# on_the_convergence_of_blackbox_variational_inference__f09888a3.pdf

On the Convergence of Black-Box Variational Inference

Kyurae Kim University of Pennsylvania kyrkim@seas.upenn.edu

Jisu Oh North Carolina State University joh26@ncsu.edu

Kaiwen Wu University of Pennsylvania kaiwenwu@seas.upenn.edu

Yi-An Ma University of California, San Diego yianma@ucsd.edu

Jacob R. Gardner University of Pennsylvania jacobrg@seas.upenn.edu

We provide the first convergence guarantee for black-box variational inference (BBVI) with the reparameterization gradient. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Notably, our analysis reveals that certain algorithm design choices commonly employed in practice, such as nonlinear parameterizations of the scale matrix, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations and thus achieves the strongest known convergence guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems.

1 Introduction

Despite the practical success of black-box variational inference (BBVI Kucukelbir et al., 2017 Ranganath et al., 2014 Titsias & Lázaro-Gredilla, 2014), also known as stochastic gradient variational Bayes and Monte Carlo variational inference, whether it converges under appropriate assumptions on the target problem have been an open problem for a decade. While our understanding of BBVI has been advancing (Bhatia et al., 2022 Challis & Barber, 2013 Domke, 2019, 2020 Hoffman & Ma, 2020), a full convergence guarantee that extends to the practical implementations as used in probabilistic programming languages (PPL) such as Stan (Carpenter et al., 2017), Turing (Ge et al., 2018), Tensorflow Probability (Dillon et al., 2017), Pyro (Bingham et al., 2019), and Py MC (Patil et al., 2010) has yet to be demonstrated.

Due to our lack of understanding, a consensus on how we should implement our BBVI algorithms has yet to be achieved. For example, when the variational family is chosen to be the location-scale family, the scale matrix can be parameterized linearly or nonlinearly, and both parameterizations are used by default in popular software packages. (See Table 1 in Kim et al. 2023.) Surprisingly, as we will show, seemingly innocuous design choices like these can substantially impact the convergence of BBVI. This is critical as BBVI has been shown to be less robust (e.g., sensitive to initial points, stepsizes, and such) than competing inference methods such as Markov chain Monte Carlo (MCMC). (See Dhaka et al., 2020 Domke, 2020 Welandawe et al., 2022 Yao et al., 2018.) Instead, the evaluation of BBVI algorithms has been relying on expensive empirical evaluations (Agrawal et al., 2020 Dhaka et al., 2021 Giordano et al., 2018 Yao et al., 2018).

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

To rigorously analyze the design of BBVI algorithms, we establish the first convergence guarantee for the implementations precisely as used in practice. We provide results for BBVI with the reparameterization gradient (RP Kingma & Welling, 2014 Titsias & Lázaro-Gredilla, 2014) and the location-scale variational family, arguably the most widely used combination in practice. Our results apply to log-smooth posteriors, which is a routine assumption for analyzing the convergence of stochastic optimization (Garrigos & Gower, 2023) and sampling algorithms (Dwivedi et al., 2019, 2.3). The key is to show that evidence lower bound (ELBO Jordan et al., 1999) satisfies regularity conditions required by convergence proofs of stochastic gradient descent (SGD Bottou, 1999 Nemirovski et al., 2009 Robbins & Monro, 1951), the workhorse underlying BBVI.

Our analysis reveals that nonlinear scale matrix parameterizations used in practice are suboptimal: they provably break strong convexity and sometimes even convexity. Even if the posterior is strongly log-concave, the ELBO is not strongly convex anymore. This contrasts with linear parameterizations, which guarantee the ELBO to be strongly convex if the posterior is strongly log-concave (Domke, 2020). Under linear parameterizations, however, the ELBO is no longer smooth, making optimization challenging. Because of this, Domke (2020) proposed to use proximal SGD, which Agrawal & Domke (2021, Appendix A) report to have better performance than vanilla SGD with nonlinear parameterizations. Indeed, we show that BBVI with proximal SGD achieves the fastest known converges rates of SGD, unlike vanilla BBVI. Thus, we provide a concrete reason for employing proximal SGD. We evaluate this insight on large-scale Bayesian inference problems by implementing an Adam-like (Kingma & Ba, 2015) variant of proximal SGD proposed by Yun et al. (2021).

Concurrently to this work, convergence guarantees on BBVI with the RP and the sticking-the-landing estimator (STL Roeder et al., 2017) under the linear parameterization were published by Domke et al. (2023). To achieve this, they show that a quadratic bound on the gradient variance is sufficient to guarantee the convergence of projected and proximal SGD. In contrast, we focus on analyzing the ELBO under nonlinear parameterizations and connect it to existing analysis strategies. A more in-depth comparison of the two works is provided in Appendix E.

Convergence Guarantee for BBVI: Theorem 3 establishes a convergence guarantee for BBVI with assumptions matching the implementations used in practice. That is, without algorithmic simplifications and unrealistic assumptions such as bounded domain or bounded support. Optimality of Linear Parameterizations: Theorem 2 shows that, for location-scale variational families, nonlinear scale parameterizations prevent the ELBO from being strongly-convex even when the target posterior is strongly log-concave. Convergence Guarantee for Proximal BBVI: Theorem 4 guarantees that, if proximal SGD is used, BBVI on 𝜇-strongly log-concave posteriors can obtain a solution 𝜖-close to the global optimum with 𝒪(1/𝜖) iterations. Evaluation of Proximal BBVI in Practice: In Section 5, we evaluate the utility of proximal SGD on large-scale Bayesian inference problems.

2 Background

Notation Random variables are denoted in serif (e.g., 𝘹, 𝙭), vectors are in bold (e.g., 𝒙, 𝙭), and matrices are in bold capitals (e.g. 𝑨). For a vector 𝒙 ℝ𝑑, we denote the inner product as 𝒙 𝒙and 𝒙, 𝒙 , the ℓ2-norm as 𝒙 2 = 𝒙 𝒙. For a matrix 𝑨, 𝑨 F = tr (𝑨 𝑨) denotes the Frobenius norm. 𝕊𝑑 ++ is the set of positive definite matrices. For some function 𝑓, D𝑖𝑓denotes the 𝑖th coordinate of 𝑓, and C𝑘(𝒳, 𝒴) is the set of 𝑘-time differentiable continuous functions mapping from 𝒳to 𝒴.

2.1 Black-Box Variational Inference

Variational inference (VI, Blei et al., 2017 Jordan et al., 1999 Zhang et al., 2019) aims to minimize the exclusive (or backward/reverse) Kullback-Leibler (KL) divergence as:

minimize 𝝀 Λ DKL (𝑞𝝀, 𝜋) 𝔼𝙯 𝑞𝝀 log 𝜋(𝙯) ℍ(𝑞𝝀) ,

where DKL (𝑞𝝀, 𝜋) is the KL divergence, ℍ is the differential entropy, 𝜋 is the (target) posterior distribution, and 𝑞𝝀 is the variational distribution,

While alternative approaches to VI (Dieng et al., 2017 Hernandez-Lobato et al., 2016 Kim et al., 2022 Naesseth et al., 2020) exist, so far, exclusive KL minimization has been the most successful. We thus use exclusive KL minimization as a synonym for VI, following convention.

Equivalently, one minimizes the negative evidence lower bound (ELBO, Jordan et al., 1999) 𝐹:

minimize 𝝀 Λ 𝐹(𝝀) 𝔼𝙯 𝑞𝝀 log 𝑝(𝒛, 𝒙) ℍ(𝑞𝝀) ,

where log 𝑝(𝒛, 𝒙) is the joint likelihood, which is proportional to the posterior as 𝜋(𝒛) 𝑝(𝒛, 𝒙) = 𝑝(𝒙 𝒛) 𝑝(𝒛), where 𝑝(𝒙 𝒛) is the likelihood and 𝑝(𝒛) is the prior.

2.2 Variational Family

In this work, we focus on the following variational family. ( d= is equivalence in distribution.)

Definition 1 (Reparameterized Family). Let 𝜑be some 𝑑-variate distribution. Then, 𝑞𝝀that can be equivalently represented as

𝙯 𝑞𝝀 𝙯 d= 𝒯𝝀(𝙪) ; 𝙪 𝜑, is said to be part of a reparameterized family generated by the base distribution 𝜑and the reparameterization function 𝒯𝝀.

Definition 2 (Location-Scale Reparameterization Function). 𝒯𝝀 ℝ𝑑 ℝ𝑑defined as

𝒯𝝀(𝒖) 𝑪𝒖+ 𝒎 with 𝝀containing the parameters for forming the location 𝒎 ℝ𝑑and scale 𝑪= 𝑪(𝝀) ℝ𝑑 𝑑is called the location-scale reparameterization function. The location-scale family enables detailed theoretical analysis, as demonstrated by (Domke, 2019, 2020 Fujisawa & Sato, 2021 Kim et al., 2023), and includes the most widely used variational families such as the Student-t, elliptical, and Gaussian families (Titsias & Lázaro-Gredilla, 2014).

Handling Constrained Support For common choices of the base distribution 𝜑, the support of 𝑞𝝀is the whole ℝ𝑑. Therefore, special treatment is needed when the support of 𝜋is constrained. Kucukelbir et al. (2017) proposed to handle this by applying diffeomorphic transformation denoted with 𝜓, often called bjectors (Dillon et al., 2017 Fjelde et al., 2020 Leger, 2023), to 𝑞𝝀such that

𝞯 𝑞𝜓,𝝀 𝞯 𝑑= 𝜓 1(𝙯); 𝙯 𝑞𝝀,

such that the support of 𝑞𝜓,𝝀matches that of 𝜋. For example, when the support of 𝜋is ℝ+, one can choose 𝜓 1 = exp. This approach, known as automatic differentiation VI (ADVI), is now standard in most modern PPLs.

Why focus on posteriors with unconstrained supports? When bijectors are used, the entropy of 𝑞𝝀, ℍ(𝑞𝝀), needs to be adjusted by the Jacobian of 𝜓(Kucukelbir et al., 2017), 𝑱𝜙 1. However, applying the transformation to 𝜋instead of 𝑞𝝀is mathematically equivalent and more convenient. In fact, bijectors can be automatically incorporated into our notation by implicitly setting

𝑝(𝒙 𝒛) = 𝑝(𝒙 𝜓 1 (𝒛)) and 𝑝(𝒛) = 𝑝(𝜓 1 (𝒛)) ||𝐉𝜓 1 (𝒛)||,

such that 𝜋(𝜻) 𝑝(𝒙 𝜻) 𝑝(𝜻), where 𝜋is the constrained posterior that we are actually interested in. Therefore, our setup in Section 2.1, where the domain of 𝙯is taken to be the unconstrained ℝ𝑑, already encompasses constrained posteriors through ADVI.

Lastly, we impose light assumptions on the base distribution 𝜑, which are already satisfied by most variational families used in practice. (i.i.d.: independently and identically distributed.)

Assumption 1 (Base Distribution). 𝜑is a 𝑑-variate distribution such that 𝙪 𝜑and 𝙪= (𝘶1, , 𝘶𝑑) with i.i.d. components. Furthermore, 𝜑is (i) symmetric and standardized such that 𝔼𝘶𝑖= 0, 𝔼𝘶2 𝑖= 1, 𝔼𝘶3 𝑖= 0, and (ii) has finite kurtosis 𝔼𝘶4 𝑖= 𝑘𝜑< .

The assumptions on the variational family we will use throughout this work are collectively summarized in the following assumption:

Assumption 2. The variational family is the location-scale family formed by Definitions 1 and 2 with the base distribution 𝜑satisfying Assumption 1.

2.3 Scale Parameterizations

For the scale matrix 𝑪(𝝀) in the location-scale family, any parameterization that results in a positive-definite covariance 𝑪𝑪 𝕊𝑑 ++ is valid. However, for the ELBO to ever be convex, the entropy ℍ(𝑞𝝀) must be convex, which requires the mapping 𝝀 𝑪𝑪 to be convex. To ensure this, we restrict 𝑪to (lower) triangular matrices with strictly positive eigenvalues, essentially, Cholesky factors. This leaves two of the most common parameterizations:

Definition 3 (Mean-Field Family.).

𝑪= 𝑫𝜙(𝒔) where the 𝑑elements of 𝒔forms the diagonal and 𝝀 Λ such that

Λ = {(𝒎, 𝒔) 𝒎 ℝ𝑑, 𝒔 𝒮}.

Definition 4 (Full-Rank Cholesky Family).

𝑪= 𝑫𝜙(𝒔) + 𝑳, where the 𝑑elements of 𝒔forms the diagonal, 𝑳is 𝑑by-𝑑strictly lower triangular, and 𝝀 Λ such that

Λ = {(𝒎, 𝒔, 𝑳) 𝒎 ℝ𝑑, 𝒔 𝒮, vec (𝑳) ℝ(𝑑+1)𝑑/2} .

Here, 𝑆is discussed in the next paragraph, 𝑫𝜙(𝒔) ℝ𝑑 𝑑is a diagonal matrix such that 𝑫𝜙(𝒔) diag (𝝓(𝒔)) = diag (𝜙(𝑠1) , , 𝜙(𝑠𝑑)), and 𝜙is a function we call a diagonal conditioner.

Linear v.s. Nonlinear Parameterizations When the diagonal conditioner is a linear function 𝜙(𝑥) = 𝑥, we say that the covariance parameterization is linear. In this case, to ensure that 𝑪is a Cholesky factor, the domain of 𝒔is set as 𝒮= ℝ𝑑 +. On the other hand, by choosing a nonlinear conditioner 𝜙 ℝ ℝ+, we can make the domain of 𝒔to be the unconstrained 𝒮= ℝ𝑑. Because of this, nonlinear conditioners such as the softplus (𝑥) log (1 + exp (𝑥)) (Dugas et al., 2000) are frequently used in practice, especially for mean-field. (See Table 1 by Kim et al., 2023).

2.4 Problem Structure of Black-Box Variational Inference

Exclusive KL minimization VI is fundamentally a composite (regularized) optimization problem

𝐹(𝝀) = 𝑓(𝝀) + ℎ(𝝀) , (ELBO)

where 𝑓(𝝀) 𝔼𝙯 𝑞𝝀ℓ(𝙯) is the energy term, ℓ(𝒛) log 𝑝(𝒛, 𝒙) is the negative joint log-likelihood, and ℎ(𝝀) ℍ(𝑞𝜓,𝝀) is the entropic regularizer. From here, BBVI introduces more structure.

CP Composite IS Infinite Sum RP Reparameterized FS Finite Sum ERM Empirical Risk Minimization

Figure 1: Taxonomy of variational inference. Within BBVI, this work only considers the reparameterization gradient (BBVI RP, shown in dark red). This leaves out BBVI with the score gradient (BBVI RP, shown in light red). The set VI FS includes sparse variational Gaussian processes (Titsias, 2009), while the remaining set VI (FS IS RP) includes coordinate ascent VI (Blei et al., 2017).

An illustration of the taxonomy is shown in Figure 1. In particular, BBVI has an infinite sum structure (IS). That is, it cannot be represented as a sum of finite subcomponents as in ERM. Furthermore,

𝐹(𝝀) = 𝔼𝙪 𝜑𝑓(𝝀; 𝙪) + ℎ(𝝀) (CP IS)

= 𝔼𝙪 𝜑ℓ(𝒯𝝀(𝙪)) + ℎ(𝝀) , (CP IS RP)

where 𝑓(𝝀; 𝒖) ℓ(𝒯𝝀(𝒖)).

Theoretical Challenges The structure of BBVI has multiple challenges that have hindered its theoretical analysis: (i) the stochasticity of the Jacobian of 𝒯and (ii) The infinite sum structure.

For Item (i), we can see that in

𝝀ℓ(𝒯𝝀(𝒖)) = 𝒯𝝀(𝒖)

𝝀 ℓ(𝒯𝝀(𝒖)) = 𝒯𝝀(𝒖)

𝝀 𝑔(𝝀; 𝒖) ,

where 𝑔(𝝀; 𝒖) ( ℓ 𝒯𝝀) (𝙪), both the Jacobian of 𝒯𝝀 and the gradient of the log-likelihood, 𝑔, depend on the randomness 𝒖. Effectively decoupling the two is a major challenge to analyzing the properties of the ELBO and its gradient estimators (Domke, 2019, 2020).

For Item (ii), the problem is that recent analyses of SGD (Garrigos & Gower, 2023 Gower et al., 2019 Nguyen et al., 2018 Vaswani et al., 2019) have increasingly been relying on the assumption that 𝑓(𝝀; 𝒖) is smooth for all 𝒖such that 𝝀𝑓(𝝀; 𝒖) 𝝀𝑓(𝝀 ; 𝒖) 𝐿 𝝀 𝝀 for some 𝐿< . This is sensible if the support of 𝙪is bounded, which is true for the ERM setting but not for the class of infinite sum (IS) problems. Previous works circumvented this issue by assuming (i) that the support of 𝙪is bounded (Fujisawa & Sato, 2021) which implicitly changes the variational family, or (ii) that the gradient 𝑓is bounded by a constant (Buchholz et al., 2018 Liu & Owen, 2021) which contradicts strong convexity (Nguyen et al., 2018).

3 The Evidence Lower Bound Under Nonlinear Scale Parameterizations

Under the linear parameterization (𝜙(𝑥) = 𝑥), the properties of the ELBO, such as smoothness and convexity, have been previously analyzed by Challis & Barber (2013) Domke (2020) Titsias & Lázaro-Gredilla (2014). We generalize these results to nonlinear conditioners.

3.1 Technical Assumptions Let 𝑔𝑖(𝝀; 𝙪) be the 𝑖th coordinate of 𝙜(𝝀; 𝙪) and recall that 𝘶𝑖denote the 𝑖th element of 𝙪. Establishing convexity and smoothness of the ELBO under nonlinear parameterizations depends on a pair of necessary and sufficient assumptions. To establish smoothness:

Assumption 3. The gradient of ℓunder reparameterization, 𝑔, satisfies |𝔼𝑔𝑖(𝝀; 𝙪) 𝘶𝑖𝜙 (𝑠𝑖)| 𝐿𝑠 for every coordinate 𝑖= 1, 𝑑, any 𝝀 Λ, and some 0 < 𝐿𝑠< . Here, 𝜙 is the second derivative of 𝜙. The next one is required to establish convexity:

Assumption 4. The gradient of ℓunder reparameterization, 𝑔, satisfies 𝔼𝑔𝑖(𝝀; 𝙪) 𝘶𝑖 0 for every coordinate 𝑖= 1, 𝑑. Intuitively, these assumption control how much ℓand 𝒯𝝀rotate the randomness 𝙪. (Notice that the assumptions are closely related to the matrix Cov (𝑔(𝝀; 𝙪) , 𝙪), the covariance between 𝑔and 𝙪.) However, the peculiar aspect of these assumptions is that they are not implied by the convexity and smoothness of ℓ. Especially, Assumption 3 strongly depends on the internals of ℓ.

3.2 Smoothness of the Entropy Under the linear parameterization, Domke (2020) has previously shown that the entropic regularizer term ℎis not smooth. This fact immediately implies the ELBO is not smooth. However, certain nonlinear conditioners do result in a smooth regularizer.

Lemma 1. If the diagonal conditioner 𝜙is 𝐿ℎ-log-smooth, then the entropic regularizer ℎ(𝝀) is 𝐿ℎ-smooth.

Proof. See the full proof in page 24. Example 1. The following diagonal conditioners result in a smooth entropic regularizer:

1. Let 𝜙(𝑥) = softplus (𝑥). Then, ℎis 𝐿ℎ-smooth with 𝐿ℎ 0.167096. 2. Let 𝜙(𝑥) = exp (𝑥). Then, ℎis 𝐿ℎ-smooth for arbitrarily small 𝐿ℎ. This might initially suggest that diagonal conditioners are a promising way of making the ELBO globally smooth. Unfortunately, the properties of the energy, 𝑓, change unfavorably.

3.3 Smoothness of the Energy Inapplicability of Existing Proof Strategy Previously, Domke (2020, Theorem 1) have proven that the energy is smooth when 𝜙is linear. The key step was to use Bessel s inequality based on the observation that the partial derivatives of the reparameterization function 𝒯form unit bases in expectation. That is,

where 𝟙𝑖=𝑗is an indicator function that is 1 only when 𝑖= 𝑗and 0 otherwise.

Unfortunately, when 𝜙is nonlinear, the partial derivatives 𝒯𝝀(𝙪)/ 𝜆𝑖for 𝑖= 1, , 𝑝no longer form unit bases: while they are still orthogonal in expectation, the lengths change nonlinearly depending on 𝝀. This leaves Bessel s inequality inapplicable. To circumvent this challenge, we establish a replacement for Bessel s inequality:

Lemma 2. Let 𝙃be a 𝑛 𝑛symmetric random matrix, where it is bounded as 𝙃 2 𝐿< almost surely. Also, let 𝙅be an 𝑚 𝑛random matrix such that 𝔼𝙅 𝙅 2 < . Then,

𝔼𝙅 𝙃𝙅 2 𝐿 𝔼𝙅 𝙅 2.

Proof. See the full proof in page 24. Remark 1. By assuming that the joint log-likelihood ℓis smooth and twice-differentiable, we retrieve Theorem 1 of Domke (2020) by setting 𝙅to be the Jacobian of 𝒯, and 𝙃to be the Hessian of ℓunder reparameterization.

Remark 2. While our reparameterization function s partial derivatives still form orthogonal bases, they need not be unlike Bessel s inequality, Lemma 2 does not require this. This implies that Lemma 2 is a strategy more general than Bessel s inequality. Equipped with Lemma 2, we present our main result on smoothness:

Theorem 1. Let ℓbe 𝐿ℓ-smooth and twice differentiable. Then, the following results hold:

(i) If 𝜙is linear, the energy 𝑓is 𝐿ℓ-smooth. (ii) If 𝜙is 1-Lipschitz, the energy ℓis (𝐿ℓ+ 𝐿𝑠)-smooth if and only if Assumption 3 holds.

Proof. See the full proof in page 27.

Combined with Lemma 1, this directly implies that the overall ELBO is smooth.

Corollary 1 (Smoothness of the ELBO). Let ℓbe 𝐿ℓ-smooth and Assumption 3 hold. Furthermore, let the diagonal conditioner be 1-Lipschitz continuous, and 𝐿𝜙-log-smooth. Then, the ELBO is (𝐿ℓ+ 𝐿𝑠+ 𝐿𝜙)-smooth.

The increase of the smoothness constant implies that we need to use a smaller stepsize to guarantee convergence when using a nonlinear 𝜙. Furthermore, even on simple 𝐿-smooth examples Assumption 3 may not hold: Example 2. Let ℓ(𝒛) = (1/2) 𝒛 𝑨𝒛and the diagonal conditioner be 𝜙(𝑥) = softplus (𝑥). Then,

(i) if 𝑨is dense and the variational family is the mean-field family or (ii) if 𝑨is diagonal and the variational family is the Cholesky family,

Assumption 3 holds with 𝐿𝑠 0.26034 (max𝑖=1, ,𝑑𝐴𝑖𝑖).

(iii) If 𝑨is dense but the Cholesky family is used, Assumption 3 does not hold.

Proof. See the full proof in page 29.

ELBO with 𝜙(𝑥) = softplus(𝑥) ELBO with 𝜙(𝑥) = 𝑥 Lower-Bounding Quadratic Tangent Line at 𝑠1 = 5

Figure 2: Optimization landscape resulting from different 𝜙on a strongly-convex ℓ. ℓis the counter-example of Proposition 1 Item (ii). 𝜙(𝑥) = 𝑥preserves strong convexity as shown by the lower-bounding quadratic (red dotted line ). 𝜙= softplus violates the first-order condition of convexity (black dotted line ).

Example 2 illustrates that establishing the smoothness of the energy becomes non-trivial under nonlinear parameterizations. Even when smoothness does hold, the increased smoothness constant implies that BBVI will be less robust to initialization and stepsizes. Furthermore, in the next section, we will show a much more grave problem: nonlinear parameterizations may affect the convergence rate.

3.4 Convexity of the Energy

The convexity of the ELBO under linear parameterizations has first been established by Titsias & Lázaro Gredilla (2014, Proposition 1) and Domke (2020, Theorem 9). In particular, Domke (2020) show that, when 𝜙is linear, if ℓis 𝜇-strongly convex, the energy is also 𝜇-strongly convex. However, when using a nonlinear 𝜙with a co-domain of ℝ+, which is the whole point of using a nonlinear conditioner, strong convexity of ℓnever transfers to 𝑓.

Theorem 2. Let ℓbe 𝜇-strongly convex. Then, we have the following:

(i) If 𝜙is linear, the energy 𝑓is 𝜇-strongly convex. (ii) If 𝜙is convex, the energy 𝑓is convex if and only if Assumption 4 holds. (iii) If 𝜙is such that 𝜙 C1 (ℝ, ℝ+), the energy 𝑓is not strongly convex.

Proof. See the full proof in page 33. The following proposition provides some conditions for Assumption 4 to hold or not hold.

Proposition 1. We have the following:

(i) If ℓis convex, then for the mean-field family, Assumption 4 holds. (ii) For the Cholesky family, there exists a convex ℓwhere Assumption 4 does not hold. Proof. See the full proof in page 31.

For any continuous, differentiable nonlinear conditioner that maps only to non-negative reals, the strong convexity of ℓdoes lead to a strongly-convex ELBO. This phenomenon is visualized in Figure 2. The loss surface becomes flat near the optimal scale parameter. This problem becomes more noticeable as the optimal scale becomes smaller.

Nonlinear conditioners are suboptimal. As the dataset grows, Bayesian posteriors are known to contract as characterized by the Bernstein-von Mises theorem (van der Vaart, 1998). That is, the posterior variance becomes close to 0. This behavior also applies to misspecified variational posteriors as shown by Wang & Blei (2019). Thus, for large datasets, nonlinear conditioners mostly operate in the regime where they are suboptimal (locally less strongly convex). But linear conditioners result in a non-smooth entropy (Domke, 2020). This dilemma originally motivated Domke to consider proximal SGD, which we analyze in Section 4.2.

4 Convergence Analysis of Black-Box Variational Inference

4.1 Black-Box Variational Inference

BBVI with SGD repeats the steps:

𝝀𝑡+1 = 𝝀𝑡 𝛾𝑡( ˆ 𝑓(𝝀𝑡) + ℎ(𝝀𝑡)) , where ˆ 𝑓(𝝀𝑡) = 1

𝑚=1 𝝀ℓ(𝒯𝝀(𝙪𝑚)) (1)

with 𝙪𝑚 𝜑is the 𝑀-sample reparameterization gradient estimator and 𝛾𝑡is the stepsize. (See Kucukelbir et al., 2017 for algorithmic details.)

With our results in Section 3 and the results of Khaled & Richtárik (2023) Kim et al. (2023), we obtain a convergence guarantee. To apply the result of Kim et al. (2023), which bounds the gradient variance, we require an additional assumption.

Assumption 5. The negative log-likelihood ℓlike(𝒛) log 𝑝(𝒙 𝒛) is 𝜇-quadratically growing for all 𝒛 ℝ𝑑such that 𝜇 2 𝒛 𝒛like 2 2 ℓlike (𝒛) ℓ like, where 𝒛like is the projection of 𝒛to the set of minimizers of ℓlike, and ℓ like = inf𝒛 ℝ𝑑ℓlike (𝒛).

This assumption is weaker than assuming that the likelihood satisfies the Polyak-Łojasiewicz inequality (Karimi et al., 2016).

Theorem 3. Let Assumption 2 hold, the likelihood satisfy Assumption 5, and the assumptions of Corollary 1 hold such that the ELBO 𝐹is 𝐿𝐹-smooth with 𝐿𝐹= 𝐿ℓ+ 𝐿𝜙+ 𝐿𝑠. Then, the iterates generated by BBVI through Equation (1) and the 𝑀-sample reparameterization gradient include an 𝜖-stationary point such that min0 𝑡 𝑇 1 𝔼 𝐹(𝝀𝑡) 2 𝜖for any 𝜖> 0 if

𝑇 𝒪( (𝐹(𝝀0) 𝐹 ) 2𝐿𝐹𝐿2 ℓ𝐶(𝑑, 𝑘𝜑) 𝜇𝑀𝜖4 )

for some fixed stepsize 𝛾, where 𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the Cholesky family and 𝐶(𝑑, 𝜑) =

2𝑘𝜑 𝑑+ 1 for the mean-field family.

Proof. See the full proof in page 35.

Remark 3. Finding an 𝜖-stationary point of the ELBO has an iteration complexity of 𝒪(𝑑𝐿2 ℓ𝜅𝑀 1𝜖 4) for the Cholesky family and 𝒪( 𝑑𝐿2 ℓ𝜅𝑀 1𝜖 4) for the mean-field family.

4.2 Black-Box Variational Inference with Proximal SGD

Proximal SGD For a composite objective 𝐹= 𝑓+ ℎ, proximal SGD repeats the steps:

𝝀𝑡+1 = prox𝛾𝑡,ℎ(𝝀𝑡 𝛾𝑡ˆ 𝑓(𝝀𝑡)) = arg min 𝝀 Λ [ ˆ 𝑓(𝝀𝑡) , 𝝀 + ℎ(𝝀) + 1

2𝛾𝑡 𝝀 𝝀𝑡 2 2 ] , (2)

where prox is known as the proximal operator and 𝛾1, , 𝛾𝑇is a stepsize schedule.

In the context of VI, proximal SGD has previously been considered by Altosaar et al. (2018) Diao et al. (2023) Khan et al. (2016, 2015). Their overall focus has been on developing alternative algorithms by generalizing 𝝀 𝝀 to other metrics. In contrast, Domke (2020) considered proximal SGD with the regular Euclidean metric 𝝀 𝝀 2 for overcoming the non-smoothness of ℎunder

linear parameterizations. Here, we prove the convergence of this scheme and show that it retrieves the fastest known convergence rates in stochastic first-order optimization.

Proximal Operator for BBVI In our context, ℎis the entropy of 𝑞𝝀in the location-scale family. For this, Domke (2020) show that the the proximal update for 𝑠1, , 𝑠𝑑, is

prox𝛾𝑡,ℎ(𝑠𝑖) = 𝑠𝑖+ 1

2 ( 𝑠2 𝑖+ 4𝛾𝑡 𝑠𝑖) .

For other parameters, the proximal operator is the regular gradient descent update in Equation (1).

Gradient Variance Bound We first establish a bound on the gradient variance. In ERM, contemporary strategies do this by exploiting the finite sum structure of the objective (Section 2.4). Here, we establish a variance bound for RP estimator that does not rely on the finite sum assumption.

Lemma 3 (Convex Expected Smoothness). Let ℓbe 𝐿ℓ-smooth and 𝜇-strongly convex with the variational family satisfying Assumption 2 with the linear parameterization. Then,

𝔼 𝝀𝑓(𝝀; 𝙪) 𝝀 𝑓(𝝀 ; 𝙪) 2 2 2𝐿ℓ𝜅𝐶(𝑑, 𝜑) B𝑓(𝝀, 𝝀 )

holds, where B𝑓(𝝀, 𝝀 ) 𝑓(𝝀) 𝑓(𝝀 ) 𝑓(𝝀 ) , 𝝀 𝝀 is the Bregman divergence, 𝜅= 𝐿ℓ/𝜇

is the condition number, 𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the Cholesky family, and 𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family. Proof. See the full proof in page 36. Furthermore, the gradient variance at the optimum must be bounded:

Lemma 4 (Domke, 2019 Kim et al., 2023). Let ℓbe 𝐿ℓ-smooth with the variational family satisfying Assumption 2 and a 1-Lipschitz diagonal conditioner 𝜙. Then, the gradient variance at the optimum 𝝀 arg min𝝀 Λ 𝐹(𝝀) is bounded as

𝑀𝐶(𝑑, 𝜑) 𝐿2 ℓ( 𝒛 𝒎 2 2 + 𝑪 2 F) ,

where 𝒛is a stationary point of ℓ, 𝒎 and 𝑪 are the location and scale formed by 𝝀 , the constants are 𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the Cholesky family and 𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family, 𝑘𝜑is the kurtosis of 𝜑as defined in Assumption 1. Proof. The full-rank case is proven by Domke (2019, Theorem 3), while the mean-field case is a basic corollary of the result by Kim et al. (2023, Lemma 2). Remark 4. The dimensional dependence in the complexity of BBVI is transferred from the variance bound in Lemma 4. Unfortunately, for the Cholesky family, this dimensional dependence in the variance bound is tight (Domke, 2019).

Main Result With the gradient variance bounds, we now present our complexity result. The proof is identical to Theorem 3.2 by Gower et al. (2019), where they use a 2-stage decreasing stepsize schedule: the stepsize is initially held constant and then reduced in a 1/𝑡rate.

Theorem 4. Let ℓbe 𝐿ℓ-smooth and 𝜇-strongly convex. Then, for any 𝜖> 0, BBVI with proximal SGD in Equation (2), the 𝑀-sample reparameterization gradient estimator, a variational family satisfying Assumption 2 with the linear parameterization guarantees 𝔼 𝝀𝑇 𝝀 2 2 𝜖if

2𝐿ℓ𝜅𝐶(𝑑,𝜑) for 𝑡 4𝑇𝜅 2𝑡+1

(𝑡+1)2𝜇 for 𝑡> 4𝑇𝜅, 𝑇 max ( 8𝜎2

𝜇2 𝜖+ 4𝑇𝜅 𝝀0 𝝀 2

where 𝜎2 is defined in Lemma 4, 𝑇𝜅= 𝜅2𝐶(𝑑, 𝜑) 𝑀 1 , 𝜅= 𝐿ℓ/𝜇is the condition number, e is Euler s constant, 𝝀 = arg min𝝀 Λ 𝐹(𝝀), 𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the Cholesky family, and

𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family.

Proof. See the full proof in page 38.

Remark 5. BBVI with proximal SGD on 𝜇-strongly convex and 𝐿ℓ-smooth ℓhas a complexity 𝒪(𝜅2𝑑𝑀 1 𝜖 1) for the Cholesky family and 𝒪(𝜅2 𝑑𝑀 1 𝜖 1) for the mean-field family.

Remark 6. We also provide a similar result with a fixed stepsize in Theorem 7 of Appendix F.3.2. In this case, the complexity is 𝒪(𝜅2𝑑𝑀 1𝜖 1 log 𝜖 1) for the Cholesky family and

𝒪(𝜅2 𝑑𝑀 1𝜖 1 log 𝜖 1) for the mean-field family.

10 4 10 3 10 2 10 1 101

𝑇for 𝜖-accuracy

Proximal SGD 𝜙(𝑥) = 𝑥 SGD 𝜙(𝑥) = 𝑥 SGD 𝜙(𝑥) = softplus(𝑥)

10 4 10 3 10 2 10 1

Stepsize (𝛾)

10 4 10 3 10 2 10 1

Figure 3: Stepsize versus the number of iterations for vanilla SGD and proximal SGD to achieve DKL(𝑞𝝀, 𝜋) 𝜖= 1 under different initializations for Gaussian posteriors. The initializations 𝐶(𝝀0) are 𝐈, 10 3𝐈, 10 5𝐈from left to right, respectively. The average suboptimality at iteration 𝑡 was estimated from 10 independent runs. For each run, the target posterior was a 10-dimensional Gaussian with a covariance with a condition number 𝜅= 10 and a smoothness of 𝐿= 100.

LME-election

Prox Gen-Adam ϕ(x) = x Adam ϕ(x) = x Adam ϕ(x) = softplus(x)

Iteration 1 2.5k 5k 7.5k 10k

1 5k 10k 15k 20k

1 10k 20k 30k

1 10k 20k 30k

Iteration 1 10k 20k 30k

LR-electric

1 5k 10k 15k 20k

Figure 4: Comparison of BBVI convergence speed (ELBO v.s. Iteration) of different optimization algorithms. The error bands are the 80% quantiles estimated from 20 (10 for AR-eeg) independent replications. The results shown used a base stepsize of 𝛾= 10 3, while the initial point was 𝒎0 = 𝟎, 𝑪0 = 𝐈. Details on the setup can be found in the text of Section 5.2 and Appendix G.

5 Experiments

5.1 Synthetic Problem

Setup We first compare proximal SGD against vanilla SGD with linear and nonlinear parameterizations on a synthetic problem, which is log-smooth, strongly log-concave, and the exact solution is known. While a similar experiment was already conducted by Domke (2020), here we include nonlinear parameterizations, which were not originally considered. We run all algorithms with a fixed stepsize to infer a multivariate Gaussian with a full-rank covariance matrix. The variational approximation is a full-rank Gaussian formed by 𝜑= 𝒩(0, 1) and the Cholesky parameterization.

Results The results are shown in Figure 3. Proximal SGD is clearly the most robust against initialization. Also, SGD with the nonlinear parameterization 𝜙(𝑥) = softplus(𝑥) is much slower to converge under all initializations. This confirms that linear parameterizations are indeed superior for both robustness against initializations and convergence speed.

5.2 Realistic Problems

Setup We now evaluate proximal SGD on realistic problems. In practice, Adam (Kingma & Ba, 2015) is observed to be robust against stepsize choices (Zhang et al., 2019). The reason why Adam performs well on non-smooth, non-convex problems is still under investigation (Kunstner et al., 2023 Reddi et al., 2023 Zhang et al., 2022). Nonetheless, to compare fairly against Adam, we implement a recently proposed variant of proximal SGD called Prox Gen (Yun et al., 2021), which

includes an Adam-like update rule. The probabilitic models and datasets are fully described in Appendix G. We implement these models and BBVI on top of the Turing (Ge et al., 2018) probabilistic programming framework. Due to the size of these datasets, we implement doubly stochastic subsampling (Titsias & Lázaro-Gredilla, 2014) with a batch size of 𝐵= 100 (𝐵= 500 for BT-tennis) with 𝑀= 10 Monte Carlo samples. For batch subsampling, we implement random-reshuffling, which is faster than independent subsampling both empirically (Bottou, 2009) and theoretically (Ahn et al., 2020 Haochen & Sra, 2019 Mishchenko et al., 2020 Nagaraj et al., 2019). We also observe that doubly stochastic BBVI benefits from reshuffling, but leave a detailed investigation to future works.

Results Representative results are shown in Figure 4, with additional results in Appendix H. Both Prox Gen-Adam and Adam with linear parameterizations converge faster than Adam with nonlinear parameterization. Furthermore, for the case of election and buzz, Adam with the nonlinear parameterization converges much slower than the alternatives. When using linear parameterizations, Prox Gen-Adam appears to be generally faster than Adam. We note, however, that due to the difference in the update rule between Prox Gen-Adam and Adam, proximal operators alone might not fully explain the performance difference. Nevertheless, the results of our experiment do conclusively suggest that linear parameterizations are superior.

6 Discussions

Conclusions In this work, we have proven the convergence of BBVI. Our assumptions encompass implementations that are actually used in practice, and our theoretical analysis revealed limitations in some of the popular design choices (mainly the use of nonlinear conditioners). To resolve this issue, we re-evaluated the utility of proximal SGD both theoretically and practically, where it achieved the strongest theoretical guarantees in stochastic first-order optimization.

Related Works To prove the convergence of BBVI, early works have a-priori assumed the regularity of the ELBO and the gradient estimator (Alquier & Ridgway, 2020 Buchholz et al., 2018 Khan et al., 2016, 2015 Liu & Owen, 2021 Regier et al., 2017). Towards a more rigorous understanding, Domke (2019) Fan et al. (2015) Kim et al. (2023) Xu et al. (2019) studied the reparameterization gradient, Xu & Campbell (2022) studied the asymptotics of the ELBO, Challis & Barber (2013) Domke (2020) Titsias & Lázaro-Gredilla (2014) established convexity, and Domke (2020) established smoothness. On the other hand, Bhatia et al. (2022) Hoffman & Ma (2020) established rigorous convergence guarantees by considering simplified variant of BBVI where only the scale is optimized, and Fujisawa & Sato (2021) assumed that the support of 𝜑is bounded almost surely. Meanwhile, under similar assumptions to ours, Diao et al. (2023) Lambert et al. (2022) recently established convergence guarantees for proximal SGD BBVI with a Bures-Wasserstein metric. Their computational properties differ from BBVI as they require Hessian evaluations. Also, understanding BBVI, which is VI with a Euclidean metric, is an important problem due to its practical relevance.

Limitations Our work has multiple limitations: (i) Our results are restricted to the location-scale family, (ii) the reparameterization gradient, and (iii) smooth joint log-likelihoods. However, the location-scale family with the reparameterization gradient is the most widely used combination in practice, and replacing the smoothness assumption is an active area of research in stochastic optimization. For our results on proximal SGD, we further assume that the joint log-likelihood is 𝜇-strongly convex (equivalently strongly log-concave posteriors). It is unclear how to extend the guarantees to only smooth but non-log-concave joint log-likelihoods.

Open Problems Although we have proven that the mean-field dimensional family has a dimension dependence of 𝒪( 𝑑), empirical results suggest room for improvement (Kim et al., 2023). Therefore, we pose the following conjecture:

Conjecture 1. Under mild assumptions, BBVI for the mean-field variational family converges with only logarithmic dimensional dependence or no explicit dimensional dependence at all.

This would put mean-field BBVI in a regime clearly faster than approximate MCMC (Freund et al., 2022). Also, it is unknown whether the 𝒪(𝜅2) condition number dependence dependence is tight. In fact, for proximal SGD BBVI in Bures-Wasserstien space, Diao et al. (2023) report a dependence of 𝒪(𝜅). Lastly, it would be interesting to see whether natural gradient VI (NGVI Amari, 1998 Khan & Lin, 2017) can achieve similar convergence guarantees. While it is empirically known that NGVI often converges faster (Lin et al., 2019), theoretical evidence has yet to follow.

Acknowledgments and Disclosure of Funding

The authors would like to thank Justin Domke for discussions on the concurrent results, Javier Burroni for pointing out a mistake in the earlier version of this work, and the anonymous reviewers for their constructive comments.

K. Kim and J. R. Gardner were funded by the National Science Foundation Award [IIS-2145644], while Y.-A. Ma was funded by the National Science Foundation Grants [NSF-SCALE Mo DL2134209] and [NSF-CCF-2112665 (TILOS)], the U.S. Department Of Energy, Office of Science, and the Facebook Research award.

Agrawal, Abhinav, & Domke, Justin. 2021. Amortized Variational Inference for Simple Hierarchical Models. Pages 21388 21399 of: Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc. (page 2) Agrawal, Abhinav, Sheldon, Daniel R, & Domke, Justin. 2020. Advances in Black-Box VI: Normalizing Flows, Importance Weighting, and Optimization. Pages 17358 17369 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 1) Ahn, Kwangjun, Yun, Chulhee, & Sra, Suvrit. 2020. SGD with Shuffling: Optimal Rates without Component Convexity and Large Epoch Requirements. Pages 17526 17535 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 10) Alquier, Pierre, & Ridgway, James. 2020. Concentration of Tempered Posteriors and of Their Variational Approximations. The Annals of Statistics, 48(3), 1475 1497. (page 10) Altosaar, Jaan, Ranganath, Rajesh, & Blei, David. 2018. Proximity Variational Inference. Pages 1961 1969 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 84. JMLR. (page 7) Amari, Shun-ichi. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251 276. (page 10) Bertin-Mahieux, Thierry, Ellis, Daniel P.W., Whitman, Brian, & Lamere, Paul. 2011. The Million Song Dataset. In: Proceedings of the International Conference on Music Information. (page 40) Bhatia, Kush, Kuang, Nikki Lijing, Ma, Yi-An, & Wang, Yixin. 2022 (July). Statistical and Computational Trade-Offs in Variational Inference: A Case Study in Inferential Model Selection. ar Xiv Preprint ar Xiv:2207.11208. ar Xiv. (pages 1, 10) Bingham, Eli, Chen, Jonathan P., Jankowiak, Martin, Obermeyer, Fritz, Pradhan, Neeraj, Karaletsos, Theofanis, Singh, Rohit, Szerlip, Paul, Horsfall, Paul, & Goodman, Noah D. 2019. Pyro: Deep Universal Probabilistic Programming. Journal of Machine Learning Research, 20(28), 1 6. (page 1) Blei, David M., Kucukelbir, Alp, & Mc Auliffe, Jon D. 2017. Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859 877. (pages 2, 4) Bottou, Léon. 1999. On-Line Learning and Stochastic Approximations. Pages 9 42 of: On-Line Learning in Neural Networks, first edn. Cambridge University Press. (page 2) Bottou, Léon. 2009. Curiously Fast Convergence of Some Stochastic Gradient Descent Algorithms. (page 10) Buchholz, Alexander, Wenzel, Florian, & Mandt, Stephan. 2018. Quasi-Monte Carlo Variational Inference. Pages 668 677 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 80. JMLR. (pages 4, 10) Carpenter, Bob, Gelman, Andrew, Hoffman, Matthew D., Lee, Daniel, Goodrich, Ben, Betancourt, Michael, Brubaker, Marcus, Guo, Jiqiang, Li, Peter, & Riddell, Allen. 2017. Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1). (page 1) Carvalho, Carlos M., Polson, Nicholas G., & Scott, James G. 2009. Handling Sparsity via the Horseshoe. Pages 73 80 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 5. JMLR. (page 41) Carvalho, Carlos M., Polson, Nicholas G., & Scott, James G. 2010. The Horseshoe Estimator for Sparse Signals. Biometrika, 97(2), 465 480. (page 41)

Challis, Edward, & Barber, David. 2013. Gaussian Kullback-Leibler Approximate Inference. Journal of Machine Learning Research, 14(68), 2239 2286. (pages 1, 5, 10)

Christmas, Jacqueline, & Everson, Richard. 2011. Robust Autoregression: Student-T Innovations Using Variational Bayes. IEEE Transactions on Signal Processing, 59(1), 48 57. (page 41)

Dhaka, Akash Kumar, Catalina, Alejandro, Andersen, Michael R, ns Magnusson, Må, Huggins, Jonathan, & Vehtari, Aki. 2020. Robust, Accurate Stochastic Optimization for Variational Inference. Pages 10961 10973 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 1)

Dhaka, Akash Kumar, Catalina, Alejandro, Welandawe, Manushi, Andersen, Michael R., Huggins, Jonathan, & Vehtari, Aki. 2021. Challenges and Opportunities in High Dimensional Variational Inference. Pages 7787 7798 of: Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc. (page 1)

Diao, Michael Ziyang, Balasubramanian, Krishna, Chewi, Sinho, & Salim, Adil. 2023. Forward Backward Gaussian Variational Inference via JKO in the Bures-Wasserstein Space. Pages 7960 7991 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 202. JMLR. (pages 7, 10)

Dieng, Adji Bousso, Tran, Dustin, Ranganath, Rajesh, Paisley, John, & Blei, David. 2017. Variational Inference via 𝜒Upper Bound Minimization. Pages 2729 2738 of: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (page 2)

Dillon, Joshua V., Langmore, Ian, Tran, Dustin, Brevdo, Eugene, Vasudevan, Srinivas, Moore, Dave, Patton, Brian, Alemi, Alex, Hoffman, Matt, & Saurous, Rif A. 2017 (Nov.). Tensor Flow Distributions. ar Xiv Preprint ar Xiv:1711.10604. ar Xiv. (pages 1, 3)

Domke, Justin. 2019. Provable Gradient Variance Guarantees for Black-Box Variational Inference. Pages 329 338 of: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (pages 1, 3, 4, 8, 10, 21, 22, 36)

Domke, Justin. 2020. Provable Smoothness Guarantees for Black-Box Variational Inference. Pages 2587 2596 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 119. JMLR. (pages 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 25, 27, 33)

Domke, Justin, Garrigos, Guillaume, & Gower, Robert. 2023. Provable Convergence Guarantees for Black-Box Variational Inference. In: Advances in Neural Information Processing Systems (to Appear). New Orleans, LA, USA: ar Xiv. (pages 2, 17, 21)

Dua, Dheeru, & Graff, Casey. 2017. UCI Machine Learning Repository. (page 40)

Duchi, John, Hazan, Elad, & Singer, Yoram. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul), 2121 2159. (page 20)

Dugas, Charles, Bengio, Yoshua, Bélisle, François, Nadeau, Claude, & Garcia, René. 2000. Incorporating Second-Order Functional Knowledge for Better Option Pricing. In: Advances in Neural Information Processing Systems, vol. 13. MIT Press. (page 4)

Dwivedi, Raaz, Chen, Yuansi, Wainwright, Martin J., & Yu, Bin. 2019. Log-Concave Sampling: Metropolis-Hastings Algorithms Are Fast. Journal of Machine Learning Research, 20(183), 1 42. (page 2)

Fan, Kai, Wang, Ziteng, Beck, Jeff, Kwok, James, & Heller, Katherine A. 2015. Fast Second Order Stochastic Backpropagation for Variational Inference. Pages 1387 1395 of: Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (page 10)

Fjelde, Tor Erlend, Xu, Kai, Tarek, Mohamed, Yalburgi, Sharan, & Ge, Hong. 2020. Bijectors.jl: Flexible Transformations for Probability Distributions. Pages 1 17 of: Proceedings of The Symposium on Advances in Approximate Bayesian Inference. PMLR, vol. 118. JMLR. (page 3)

Freund, Yoav, Ma, Yi-An, & Zhang, Tong. 2022. When Is the Convergence Time of Langevin Algorithms Dimension Independent? A Composite Optimization Viewpoint. Journal of Machine Learning Research, 23(214), 1 32. (page 10)

Fujisawa, Masahiro, & Sato, Issei. 2021. Multilevel Monte Carlo Variational Inference. Journal of Machine Learning Research, 22(278), 1 44. (pages 3, 4, 10)

Garrigos, Guillaume, & Gower, Robert M. 2023 (Feb.). Handbook of Convergence Theorems for (Stochastic) Gradient Methods. ar Xiv Preprint ar Xiv:2301.11235. ar Xiv. (pages 2, 4, 21, 37, 38) Ge, Hong, Xu, Kai, & Ghahramani, Zoubin. 2018. Turing: A Language for Flexible Probabilistic Inference. Pages 1682 1690 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 84. JMLR. (pages 1, 10) Gelman, Andrew, & Hill, Jennifer. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge New York: Cambridge University Press. (page 40) Giordano, Ryan, Broderick, Tamara, & Jordan, Michael I. 2018. Covariances, Robustness, and Variational Bayes. Journal of Machine Learning Research, 19(51), 1 49. (page 1) Giordano, Ryan, Ingram, Martin, & Broderick, Tamara. 2023 (Apr.). Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box. ar Xiv Preprint ar Xiv:2304.05527. ar Xiv. (page 41) Goldberger, Ary L., Amaral, Luis A. N., Glass, Leon, Hausdorff, Jeffrey M., Ivanov, Plamen Ch., Mark, Roger G., Mietus, Joseph E., Moody, George B., Peng, Chung-Kang, & Stanley, H. Eugene. 2000. Physio Bank, Physio Toolkit, and Physio Net: Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101(23). (page 41) Gorbunov, Eduard, Hanzely, Filip, & Richtarik, Peter. 2020. A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent. Pages 680 690 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 108. JMLR. (pages 21, 37) Gower, Robert Mansel, Loizou, Nicolas, Qian, Xun, Sailanbayev, Alibek, Shulgin, Egor, & Richtárik, Peter. 2019. SGD: General Analysis and Improved Rates. Pages 5200 5209 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 97. JMLR. (pages 4, 8, 21, 38) Haochen, Jeff, & Sra, Suvrit. 2019. Random Shuffling Beats SGD after Finite Epochs. Pages 2624 2633 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 97. JMLR. (page 10) Hernandez-Lobato, Jose, Li, Yingzhen, Rowland, Mark, Bui, Thang, Hernandez-Lobato, Daniel, & Turner, Richard. 2016. Black-Box Alpha Divergence Minimization. Pages 1511 1520 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 48. JMLR. (page 2) Hoffman, Matthew, & Ma, Yian. 2020. Black-Box Variational Inference as a Parametric Approximation to Langevin Dynamics. Pages 4324 4341 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 119. JMLR. (pages 1, 10) Jager, F., Taddei, A., Moody, G. B., Emdin, M., Antolič, G., Dorn, R., Smrdel, A., Marchesi, C., & Mark, R. G. 2003. Long-Term ST Database: A Reference for the Development and Evaluation of Automated Ischaemia Detectors and for the Study of the Dynamics of Myocardial Ischaemia. Medical and Biological Engineering and Computing, 41(2), 172 182. (pages 40, 41) Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Tommi S., & Saul, Lawrence K. 1999. An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2), 183 233. (pages 2, 3) Karimi, Hamed, Nutini, Julie, & Schmidt, Mark. 2016. Linear Convergence of Gradient and Proximal-Gradient Methods under the Polyak-Łojasiewicz Condition. Pages 795 811 of: Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. Cham: Springer International Publishing. (page 7) Kawala, François, Douzal-Chouakria, Ahlame, Gaussier, Eric, & Diemert, Eustache. 2013. Prédictions d activité Dans Les Réseaux Sociaux En Ligne. Page 16 of: Actes de La Conférence Sur Les Modèles et L Analyse Des Réseaux : Approches Mathématiques et Informatique. (page 40) Khaled, Ahmed, & Richtárik, Peter. 2023. Better Theory for SGD in the Nonconvex World. Transactions of Machine Learning Research. (pages 7, 21, 34, 35) Khan, Mohammad, & Lin, Wu. 2017. Conjugate-Computation Variational Inference: Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models. Pages 878 887 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 54. JMLR. (page 10)

Khan, Mohammad Emtiyaz, Babanezhad, Reza, Lin, Wu, Schmidt, Mark, & Sugiyama, Masashi. 2016. Faster Stochastic Variational Inference Using Proximal-Gradient Methods with General Divergence Functions. Pages 319 328 of: Proceedings of the Conference on Uncertainty in Artificial Intelligence. UAI 16. Arlington, Virginia, USA: AUAI Press. (pages 7, 10)

Khan, Mohammad Emtiyaz E, Baque, Pierre, Fleuret, François, & Fua, Pascal. 2015. Kullback Leibler Proximal Variational Inference. In: Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (pages 7, 10)

Kim, Kyurae, Oh, Jisu, Gardner, Jacob, Dieng, Adji Bousso, & Kim, Hongseok. 2022. Markov Chain Score Ascent: A Unifying Framework of Variational Inference with Markovian Gradients. Pages 34802 34816 of: Advances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc. (page 2)

Kim, Kyurae, Wu, Kaiwen, Oh, Jisu, & Gardner, Jacob R. 2023. Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference. Pages 16853 16876 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 202. Honolulu, HI, USA: JMLR. (pages 1, 3, 4, 7, 8, 10, 21, 22, 34, 35, 36)

Kingma, Diederik P., & Ba, Jimmy. 2015. Adam: A Method for Stochastic Optimization. In: Proceedings of the International Conference on Learning Representations. (pages 2, 9, 20)

Kingma, Diederik P., & Welling, Max. 2014 (Apr.). Auto-Encoding Variational Bayes. In: Proceedings of the International Conference on Learning Representations. (page 2)

Kucukelbir, Alp, Tran, Dustin, Ranganath, Rajesh, Gelman, Andrew, & Blei, David M. 2017. Automatic Differentiation Variational Inference. Journal of Machine Learning Research, 18(14), 1 45. (pages 1, 3, 7, 21)

Kunstner, Frederik, Chen, Jacques, Lavington, Jonathan Wilder, & Schmidt, Mark. 2023 (Feb.). Noise Is Not the Main Factor behind the Gap between Sgd and Adam on Transformers, but Sign Descent Might Be. In: Proceedings of the International Conference on Learning Representations. (page 9)

Lambert, Marc, Chewi, Sinho, Bach, Francis, Bonnabel, Silvère, & Rigollet, Philippe. 2022. Variational Inference via Wasserstein Gradient Flows. Pages 14434 14447 of: Advances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc. (page 10)

Leger, Jean-Benoist. 2023 (Jan.). Parametrization Cookbook: A Set of Bijective Parametrizations for Using Machine Learning Methods in Statistical Inference. ar Xiv Preprint ar Xiv:2301.08297. ar Xiv. (page 3)

Lin, Wu, Khan, Mohammad Emtiyaz, & Schmidt, Mark. 2019. Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-Family Approximations. Pages 3992 4002 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 97. JMLR. (page 10)

Liu, Sifan, & Owen, Art B. 2021. Quasi-Monte Carlo Quasi-Newton in Variational Bayes. Journal of Machine Learning Research, 22(243), 1 23. (pages 4, 10)

Magnusson, Måns, Bürkner, Paul, & Vehtari, Aki. 2022 (Nov.). Posteriordb: A Set of Posteriors for Bayesian Inference and Probabilistic Programming. (page 40)

Mishchenko, Konstantin, Khaled, Ahmed, & Richtarik, Peter. 2020. Random Reshuffling: Simple Analysis with Vast Improvements. Pages 17309 17320 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 10)

Naesseth, Christian, Lindsten, Fredrik, & Blei, David. 2020. Markovian Score Climbing: Variational Inference with KL(p||q). Pages 15499 15510 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 2)

Nagaraj, Dheeraj, Jain, Prateek, & Netrapalli, Praneeth. 2019. SGD without Replacement: Sharper Rates for General Smooth Convex Functions. Pages 4703 4711 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 97. JMLR. (page 10)

Nemirovski, A., Juditsky, A., Lan, G., & Shapiro, A. 2009. Robust Stochastic Approximation Approach to Stochastic Programming. SIAM Journal on Optimization, 19(4), 1574 1609. (page 2)

Nguyen, Lam, Nguyen, Phuong Ha, van Dijk, Marten, Richtarik, Peter, Scheinberg, Katya, & Takac, Martin. 2018. SGD and Hogwild! Convergence without the Bounded Gradients Assumption.

Pages 3750 3758 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 80. JMLR. (page 4) Patil, Anand, Huard, David, & Fonnesbeck, Christopher. 2010. Py MC: Bayesian Stochastic Modelling in Python. Journal of Statistical Software, 35(4). (page 1) Ranganath, Rajesh, Gerrish, Sean, & Blei, David. 2014. Black Box Variational Inference. Pages 814 822 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 33. JMLR. (page 1) Reddi, Sashank J., Kale, Satyen, & Kumar, Sanjiv. 2023 (May). On the Convergence of Adam and Beyond. In: Proceedings of the International Conference on Learning Representations. (page 9) Regier, Jeffrey, Jordan, Michael I, & Mc Auliffe, Jon. 2017. Fast Black-Box Variational Inference through Stochastic Trust-Region Optimization. Pages 2399 2408 of: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (page 10) Robbins, Herbert, & Monro, Sutton. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400 407. (page 2) Roeder, Geoffrey, Wu, Yuhuai, & Duvenaud, David K. 2017. Sticking the Landing: Simple, Lower Variance Gradient Estimators for Variational Inference. Pages 6928 6937 of: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (pages 2, 21) Shannon, Paul, Markiel, Andrew, Ozier, Owen, Baliga, Nitin S., Wang, Jonathan T., Ramage, Daniel, Amin, Nada, Schwikowski, Benno, & Ideker, Trey. 2003. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Research, 13(11), 2498 2504. (page 40) Titsias, Michalis. 2009. Variational Learning of Inducing Variables in Sparse Gaussian Processes. Pages 567 574 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 5. JMLR. (page 4) Titsias, Michalis, & Lázaro-Gredilla, Miguel. 2014. Doubly Stochastic Variational Bayes for Non Conjugate Inference. Pages 1971 1979 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 32. JMLR. (pages 1, 2, 3, 5, 6, 10, 21) van der Vaart, A. W. 1998. Asymptotic Statistics. First edn. Cambridge University Press. (page 7) Vaswani, Sharan, Bach, Francis, & Schmidt, Mark. 2019. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron. Pages 1195 1204 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 89. JMLR. (page 4) Wang, Yixin, & Blei, David. 2019. Variational Bayes under Model Misspecification. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (page 7) Welandawe, Manushi, Andersen, Michael Riis, Vehtari, Aki, & Huggins, Jonathan H. 2022 (Mar.). Robust, Automated, and Accurate Black-Box Variational Inference. ar Xiv Preprint ar Xiv:2203.15945. ar Xiv. (page 1) Wright, Stephen J., & Recht, Benjamin. 2021. Optimization for Data Analysis. New York: Cambridge University Press. (page 21) Xu, Ming, Quiroz, Matias, Kohn, Robert, & Sisson, Scott A. 2019. Variance Reduction Properties of the Reparameterization Trick. Pages 2711 2720 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 89. JMLR. (page 10) Xu, Zuheng, & Campbell, Trevor. 2022. The Computational Asymptotics of Gaussian Variational Inference and the Laplace Approximation. Statistics and Computing, 32(4), 63. (page 10) Yao, Yuling, Vehtari, Aki, Simpson, Daniel, & Gelman, Andrew. 2018. Yes, but Did It Work?: Evaluating Variational Inference. Pages 5581 5590 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 80. JMLR. (page 1) Yun, Jihun, Lozano, Aurelie C, & Yang, Eunho. 2021. Adaptive Proximal Gradient Methods for Structured Neural Networks. Pages 24365 24378 of: Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc. (pages 2, 9, 20) Zhang, Cheng, Butepage, Judith, Kjellstrom, Hedvig, & Mandt, Stephan. 2019. Advances in Variational Inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 2008 2026. (pages 2, 9)

Zhang, Yushun, Chen, Congliang, Shi, Naichen, Sun, Ruoyu, & Luo, Zhi-Quan. 2022. Adam Can Converge without Any Modification on Update Rules. Pages 28386 28399 of: Advances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc. (page 9)

On the Convergence of Black-Box Variational Inference Appendix

Table of Contents

1 Introduction 1

2 Background 2 2.1 Black-Box Variational Inference . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Variational Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Scale Parameterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4 Problem Structure of Black-Box Variational Inference . . . . . . . . . . . 4

3 The Evidence Lower Bound Under Nonlinear Scale Parameterizations 5 3.1 Technical Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Smoothness of the Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Smoothness of the Energy . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 Convexity of the Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Convergence Analysis of Black-Box Variational Inference 7 4.1 Black-Box Variational Inference . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Black-Box Variational Inference with Proximal SGD . . . . . . . . . . . 7

5 Experiments 9 5.1 Synthetic Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2 Realistic Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

6 Discussions 10

A Computational Resources 18

B Nomenclature 18

C Definitions 19

D Prox Gen Adam for Black-Box Variational Inference 20

E Detailed Comparison Against Domke et al. (2023) 21

F Proofs 22 F.1 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 F.2 Properties of the Evidence Lower Bound . . . . . . . . . . . . . . . . . . 24 F.2.1 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 F.2.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 F.3 Convergence of Black-Box Variational Inference . . . . . . . . . . . . . 34 F.3.1 Vanilla Black-Box Variational Inference . . . . . . . . . . . . . . 34 F.3.2 Proximal Black-Box Variational Inference . . . . . . . . . . . . . 36

G Details of Experimental Setup 40

H Additional Experimental Results 42

A Computational Resources

Table 1: Computational Resources Type Model and Specifications

System Topology 2 nodes with 2 sockets each with 24 logical threads (total 48 threads) Processor 1 Intel Xeon Silver 4310, 2.1 GHz (maximum 3.3 GHz) per socket Cache 1.1 Mi B L1, 30 Mi B L2, and 36 Mi B L3 Memory 250 Gi B RAM Accelerator 1 NVIDIA RTX A5000 per node, 2 GHZ, 24GB RAM

Running the experiments took approximately a week.

B Nomenclature

Symbol Definition Description Section

𝝀 Variational parameters 2.1 𝒛 Parameters of the target model 𝜋 2.1 𝒯𝝀 𝑪(𝝀) 𝒖+ 𝒎 location-scale reparameterization function 2.2 𝙪 Random vector before reparameterization 2.2 𝜑 Base distribution of 𝙪 2.2 𝒎 Location parameter (part of 𝝀) 2.2 𝑪 Scale parameter (part of 𝝀) 2.2 and 2.3 𝜙 Diagonal conditioner 2.3 𝑘𝜑 Kurtosis (non-central 4th moment) of 𝙪 2.3 𝑫𝜙(𝒔) Diagonal of 𝑪using the diagonal conditioner 𝜙 2.3 𝑳 Strictly lower triangular part of 𝑪 2.3 𝒔 Elements forming the diagonal of 𝑪 2.3 ℓ(𝒛) log 𝑝(𝒛, 𝒙) Negative joint likelihood 2.4 𝑓(𝝀) 𝔼𝙯 𝑞𝝀ℓ(𝙯) Energy 2.4 ℎ(𝝀) ℍ(𝑞𝝀) Negative entropy 2.4 𝐹(𝝀) 𝑓(𝝀) + ℎ(𝝀) Negative ELBO 2.1 and 2.4 𝑓(𝝀; 𝒖) ℓ(𝒯𝝀(𝒖)) Negative Log-Likelihood under reparameterization 2.4 𝑔(𝝀; 𝒖) ℓ(𝒯𝝀(𝒖)) Gradient of the Log-likelihood under reparameterization 2.4 𝑀 Number of Monte Carlo samples 2.1 𝛾𝑡 Stepsize of (proximal) SGD at iteration 𝑡 4.1 and 4.2 ˆ 𝑓 Reparameterization gradient estimator of the energy 4.1

C Definitions

For completeness, we provide formal definitions for some of the terms we used throughout the paper.

Definition 5 (Smoothness). A function 𝑓 𝒵 ℝis said to be 𝐿-smooth if the inequality

𝑓(𝒛) 𝑓(𝒛 ) 𝒛 𝒛

holds for all 𝒛, 𝒛 𝒵.

This assumption, also occasionally called Lipschitz smoothness, restricts the amount the gradient can change for a given distance. When 𝑓is twice differentiable, an equivalent condition is the Hessian to be bounded:

Definition 6 (Smoothness). A twice differentiable function 𝑓 𝒵 ℝis said to be 𝐿-smooth if the inequality 2𝑓(𝒛) 𝐿 holds for all 𝒛 𝒵.

Remark 7. Assuming a function 𝑓is smooth is equivalent to assuming that 𝑓can be upper bounded by a quadratic function everywhere.

Remark 8. When the log-density log 𝜋of a probability measure Π is 𝐿-smooth, log 𝜋can be upper bounded everywhere by the log-density of a Gaussian.

Definition 7 (Strong Convexity). A twice differentiable function 𝑓 ℝ𝑑 ℝis said to be 𝜇-strongly convex if the inequality 𝜇 2 𝒛 𝒛 2 + 𝑓(𝒛) , 𝒛 𝒛 + 𝑓(𝒛) 𝑓(𝒛 )

holds for all 𝒛, 𝒛 ℝ𝑑and some 𝜇> 0.

Remark 9. If Definition 7 holds only for 𝜇= 0, 𝑓is said to be (non-strongly) convex.

Remark 10. Assuming a function 𝑓is strongly convex is equivalent to assuming that 𝑓can be lower bounded by a quadratic.

Definition 8 (Strongly Log-Concave Measures). For a probability measure Π in a Euclidean measurable space (ℝ𝑑, ℬ(ℝ𝑑) , ℙ), where ℬ(ℝ𝑑) is the 𝜎-algebra of Borel-measurable subsets of ℝ𝑑, ℙis the Lebesgue measure, we say Π is 𝜇-strongly log-concave if its log-density log 𝜋(𝒛) ℝ𝑑 ℝis 𝜇-strongly convex for some 𝜇> 0.

Remark 11. If Definition 8 holds only for 𝜇= 0, Π is said to be (non-strongly) log-concave.

Remark 12. When Π is 𝜇-strongly log-concave, log 𝜋can be lower bounded everywhere by the log-density of a Gaussian.

D Prox Gen Adam for Black-Box Variational Inference

Algorithm 1: Prox Gen-Adam for Black-Box Variational Inference Input: Initial variational parameters 𝝀0, base stepsize 𝛼, second moment stepsize 𝛽2, momentum stepsize {𝛽1,𝑡}𝑇 𝑡=1, small positve constant 𝜖 for 𝑡= 1, , 𝑇do

estimate gradient of energy ˆ 𝑓 𝒈𝑡= ˆ 𝑓(𝝀) + ℎ(𝝀) 𝝀𝑡+1 = 𝛽1,𝑡𝝀𝑡+ (1 𝛽1,𝑡) 𝝀𝑡 𝒗𝑡+1 = 𝛽2𝒗𝑡+ (1 𝛽2) 𝒈2 𝑡 𝜞𝑡+1 = diag (𝛼/ ( 𝒗𝑡+1 + 𝜖))

𝝀𝑡+1 = 𝝀𝑡 𝜞𝑡+1𝝀𝑡+1 𝒔𝑡+1 getscale (𝝀𝑡+1)

𝒔𝑡+1 𝒔𝑡+1 + 1

2 ( 𝒔2 𝑡+1 + 4𝜸𝒔,𝑡+1 𝒔𝑡+1)

𝝀𝑡+1 setscale (𝝀𝑡+1, 𝒔𝑡+1) end

(By convention, all vector operations are elementwise.)

Adaptive and matrix-valued stepsize-variants of SGD such as Adam (Kingma & Ba, 2015), Ada Grad (Duchi et al., 2011) are widely used. The matrix stepsize of Adam at iteration 𝑡is given as

𝜞𝑡+1 = diag (𝛼/ ( 𝒗𝑡+1 + 𝜖)) ,

where 𝒗𝑡is the exponential moving average of the second moment, 𝛼is the base stepsize. Furthermore, the matrix stepsize is applied to the moving average of the gradients, a scheme often called the (heavy-ball) momentum, denoted here as 𝝀𝑡.

Recently, Yun et al. (2021) have proven the convergence for these adaptive, momentum, and matrixvalued stepsize-based SGD methods with proximal steps. Then, the proximal operator is applied as

prox𝜞𝑡,ℎ(𝝀𝑡 𝜞𝑡𝝀𝑡) = arg min 𝝀 { 𝝀𝑡, 𝝀 + ℎ(𝝀) + 1

2(𝝀 𝝀𝑡) 𝜞 1 𝑡 (𝝀 𝝀𝑡) } .

For Adam, the matrix-valued stepsize is a diagonal matrix. Thus, the proximal operator of Domke (2020) for each 𝑠𝑖forms independent 1-dimensional quadratic problems. Thus, the proximal step is given in the closed-form

prox𝜞𝑡,ℎ(𝑠𝑖) = 𝑠𝑖+ 1

2 ( 𝑠2 𝑖+ 4𝛾𝑠𝑖 𝑠𝑖) ,

where, dropping the index 𝑡for clarity, 𝑠𝑖is the element of 𝝀𝑡corresponding to 𝑠𝑖, 𝛾𝑠𝑖denotes the stepsize of 𝑠𝑖(a diagonal element of 𝜞𝑡). Combined with the Adam-like stepsize rule, the algorithm is shown in Algorithm 1.

Difference with Adam In Algorithm 1, we can see the differences with vanilla Adam. Notably, Prox Gen Adam does not perform bias correction of the estimated moments. Furthermore, while some implementations of Adam decay 𝛽1, we keep it constant. It is possible that these differences could result in a different behavior from vanilla Adam. However, in this work, we follow the original implementation by Yun et al. (2021) as closely as possible and leave the comparison with vanilla Adam to future works.

E Detailed Comparison Against Domke et al. (2023)

In this section, we contrast our results against those of Domke et al. (2023). First, the main challenge to establishing a convergence guarantee for BBVI has been on bounding the gradient variance. In particular, Domke (2019) proved that the variance of the reparameterization gradient for the energy, ˆ 𝑓, is bounded as

2 𝛼 𝝀 𝝀 2 2 + 𝛽 (3)

for some finite positive constants 𝛼, 𝛽depending on the problem constants 𝑑, 𝐿, 𝑘𝜑. Domke et al. (2023) call a gradient estimator satisfying this bound to be a quadratic variance estimator. Furthermore, they prove that the closed-form entropy (CFE Kucukelbir et al., 2017 Titsias & Lázaro Gredilla, 2014) estimator:

ˆ 𝐹CFE (𝝀) ˆ 𝑓(𝝀) + ℎ(𝝀) and the STL estimator by Roeder et al. (2017):

ˆ 𝐹STL (𝝀) 1

𝑚=1 𝝀ℓ(𝒯𝝀(𝙪𝑚)) + 𝝀log 𝑞𝝀(𝒯𝝂(𝙪𝑚)) ||𝝂=𝝀,

where 𝙪 𝜑, also qualify as quadratic variance estimators.

Unfortunately, it has been unknown whether SGD is guaranteed to converge with a quadratic variance estimator except for strongly convex objectives (Wright & Recht, 2021, p. 85). Domke et al. (2023) expand the boundaries of SGD and prove that projected and proximal SGD with a quadratic variance estimator converges for both convex and strongly convex objectives. In particular, for the location-scale variational family, the linear parameterization, and log-concave objectives, they prove a complexity of 𝒪(1/𝜖2), and for strongly log-concave objectives, they prove a complexity of 𝒪(1/𝜖).

On the other hand, Kim et al. (2023) and Lemma 12 developed the bound in Equation (3) to be of the form of

2 𝐴(𝐹(𝝀) 𝐹 ) + 𝐹(𝝀) 2 2 + 𝐶 (4)

for some positive finite constants 𝐴, 𝐵, 𝐶, for which the convergence of SGD for convex, strongly convex (Garrigos & Gower, 2023 Gorbunov et al., 2020), and non-convex objectives (Khaled & Richtárik, 2023) have already been proven. Applying these results to log-smooth and logquadratically growing objectives, we prove a complexity of 𝒪(1/𝜖4), while for strong log-concave objectives, we also prove a complexity of 𝒪(1/𝜖).

Overall, both approaches can be summarized as follows: we focused on establishing gradient variance bounds of known convergence proofs, while Domke et al. (2023) aimed to prove that the bound by Domke (2019) is sufficient to guarantee convergence. Note that, for strongly log-concave objectives, Equation (3) immediately implies Equation (4). Therefore, both approaches intersect in the case of strongly log-concave objectives. Indeed, Theorem 8 and the analogous result of Domke et al. (2023) are both based on the same proof strategy by Gower et al. (2019).

F.1 Auxiliary Lemmas

Lemma 5. Let 𝜙(𝑥) = 𝑥. Then, the parameterization is linear in the sense that 𝒯𝝀is a bilinear function such that 𝒯𝝀 𝝀 (𝒖) = 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) . for any 𝝀, 𝝀 Λ.

𝒯𝝀 𝝀 (𝒖) = (𝑪(𝝀 𝝀 )) 𝒖+ (𝒎 𝒎 )

= (𝑫𝜙(𝒔 𝒔 ) + (𝑳 𝑳 )) 𝒖+ (𝒎 𝒎 ) , using the fact that 𝜙is the identity function, = (𝑫𝜙(𝒔) 𝑫𝜙(𝒔 ) + (𝑳 𝑳 )) 𝒖+ (𝒎 𝒎 )

= (𝑪(𝝀) 𝑪(𝝀 )) 𝒖+ (𝒎+ 𝒎 ) = (𝑪(𝝀) 𝒖+ 𝒎) (𝑪(𝝀 ) 𝒖+ 𝒎 ) = 𝒯𝝀(𝒖) 𝒯𝝀 (𝒖) . The linearity with respect to 𝒖is obvious.

Lemma 6. Let the linear parameterization be used. Then, for any 𝝀, 𝝀 Λ, the inner product of the Jacobian of the reparameterization function satisfies the following equalities for any 𝒖 ℝ𝑑.

(i) For the Cholesky family (Domke, 2019, Lemma 8),

𝝀 = (1 + 𝒖 2 2) 𝐈

(ii) For the mean-field family (Kim et al., 2023, Lemma 1),

𝝀 = (1 + 𝑼2 F) 𝐈,

where 𝑼= diag (𝑢1, , 𝑢𝑑).

Lemma 7. Let the linear parameterization be used. Then, for any 𝝀 Λ and any 𝒛 ℝ𝑑, the following relationships hold.

(i) For the Cholesky family (Domke, 2019, Lemma 2),

𝔼(1 + 𝙪 2 2) 𝒯𝝀(𝙪) 𝒛 2 2 = (𝑑+ 1) 𝒎 𝒛 2 2 + (𝑑+ 𝑘𝜑) 𝑪 2 F

(ii) For the mean-field family (Kim et al., 2023, Lemma 2),

𝔼(1 + 𝙐2 F) 𝒯𝝀(𝙪) 𝒛 2 2 ( 𝑑𝑘𝜑+ 𝑘𝜑 𝑑+ 1) 𝒎 𝒛 2 2 + (2𝑘𝜑 𝑑+ 1) 𝑪 2 F.

Corollary 2. Let the linear parameterization be used and 𝝀, 𝝀 Λ be any pair of variational parameters.

(i) For the Cholesky family,

𝔼(1 + 𝙪 2 2) 𝒯𝝀 (𝙪) 𝒯𝝀(𝙪) 2 2 (𝑘𝜑+ 𝑑) 𝝀 𝝀 2 2

(ii) For the mean-field family,

𝔼(1 + 𝙐2 F) 𝒯𝝀 (𝙪) 𝒯𝝀(𝙪) 2 2 (2𝑘𝜑 𝑑+ 1) 𝝀 𝝀 2 2

Proof. The results are a direct consequence of Lemma 7 and Lemma 5.

Proof of (i) We start from Lemma 7 as

𝔼(1 + 𝙪 2 2) 𝒯𝝀 𝝀 (𝙪) 𝒛 2 2 = (𝑑+ 1) (𝒎 𝒎 ) 𝒛 2 2 + (𝑑+ 𝑘𝜑) 𝑪(𝝀) 𝑪(𝝀 ) 2 F,

setting 𝒛= 𝟎,

= (𝑑+ 1) 𝒎 𝒎 2 2 + (𝑑+ 𝑘𝜑) 𝑪(𝝀) 𝑪(𝝀 ) 2 F, and since 𝑘𝜑 3 by the property of the kurtosis,

(𝑑+ 𝑘𝜑) ( 𝒎 𝒎 2 2 + 𝑪(𝝀) 𝑪(𝝀 ) 2 F)

= (𝑑+ 𝑘𝜑) 𝝀 𝝀 2 2.

Proof of (ii) Similarly, for the mean-field family, we can apply Lemma 7 as

𝔼(1 + 𝙐2 F) 𝒯𝝀 𝝀 (𝙪) 𝒛 2 2 ( 𝑑𝑘𝜑+ 𝑘𝜑 𝑑+ 1) (𝒎 𝒎 ) 𝒛 2 2 + (2𝑘𝜑 𝑑+ 1) 𝑪 𝑪 2 F,

setting 𝒛= 𝟎,

= ( 𝑑𝑘𝜑+ 𝑘𝜑 𝑑+ 1) 𝒎 𝒎 2 2 + (2𝑘𝜑 𝑑+ 1) 𝑪 𝑪 2 F,

and since 𝑘𝜑 3 by the property of the kurtosis,

(2𝑘𝜑 𝑑+ 1) ( 𝒎 𝒎 2 2 + 𝑪(𝝀) 𝑪(𝝀 ) 2 F)

= (2𝑘𝜑 𝑑+ 1) 𝝀 𝝀 2 2.

Lemma 8. For the linear parameterization,

𝔼 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 2 2 = 𝝀 𝝀 2 2 for any 𝝀, 𝝀 Λ.

Proof. First notice that, for linear parameterizations, we have

𝔼 𝒯𝝀(𝙪) 2 2 = 𝔼 𝑪𝙪+ 𝒎 2 2 = 𝔼𝙪 𝑪 𝑪𝙪+ 𝒎 2 2 + 2𝒎 𝑪𝔼𝙪

= 𝔼tr (𝙪 𝑪 𝑪𝙪) + 𝒎 2 2 + 2𝒎 𝑪𝔼𝙪, rotating the elements of the trace,

= tr (𝑪 𝑪𝔼𝙪𝙪 ) + 𝒎 2 2 + 2𝒎 𝑪𝔼𝙪, applying Assumption 1

= tr (𝑪 𝑪) + 𝒎 2 2 = 𝑪 2 F + 𝒎 2 2 = 𝝀 2 2.

Combined with Lemma 5, we have

𝔼 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 2 2 = 𝔼 𝒯𝝀 𝝀 (𝙪) 2 2 = 𝝀 𝝀 2 2.

F.2 Properties of the Evidence Lower Bound

F.2.1 Smoothness

Lemma 1. If the diagonal conditioner 𝜙is 𝐿ℎ-log-smooth, then the entropic regularizer ℎ(𝝀) is 𝐿ℎ-smooth.

Proof. The entropic regularizer is

ℎ(𝝀) = H (𝜑)

𝑖=1 log 𝜙(𝑠𝑖) ,

and depends only on the diagonal elements 𝑠1, , 𝑠𝑑of 𝑪. The Hessian of ℎis then a diagonal matrix, where only the entries that correspond to 𝑠1, , 𝑠𝑑are non-zero. The Lipschitz smoothness constant is then the constant 𝐿ℎ< that satisfies

𝑠2 𝑖 = d2 log 𝜙

for all 𝑖= 1, , 𝑑, which is the smoothness constant of 𝑠𝑖 log 𝜙(𝑠𝑖).

Lemma 2. Let 𝙃be a 𝑛 𝑛symmetric random matrix, where it is bounded as 𝙃 2 𝐿< almost surely. Also, let 𝙅be an 𝑚 𝑛random matrix such that 𝔼𝙅 𝙅 2 < . Then,

𝔼𝙅 𝙃𝙅 2 𝐿 𝔼𝙅 𝙅 2.

Proof. By the property of the Rayleigh quotients, for a symmetric matrix 𝑨, its maximum eigenvalue is given in the variational form

sup 𝒙 1 𝒙 𝑯𝒙= 𝜎max (𝑯) 𝜎max (𝑯) 2 = 𝑯 2,

where 𝜎max (𝑨) is the maximal eigenvalue of 𝑨. Notice the relationship with the ℓ2-operator norm. The inequality is strict only if all eigenvalues are negative.

From the property above,

𝔼𝙅 𝙃𝙅 2 = sup 𝒙 2 1 𝒙 (𝔼𝙅 𝙃𝙅) 𝒙.

By reparameterizing as 𝙮= 𝙅𝒙, = sup 𝒙 2 1 𝔼𝙮 𝙃𝙮,

and the property of the ℓ2-operator norm,

sup 𝒙 2 1 𝔼 𝙃 2 𝙮 2 2 = sup 𝒙 2 1 𝔼 𝙃 2 𝙅𝒙 2 2.

From our assumption about the maximal eigenvalue of 𝑯,

𝐿sup 𝒙 2 1 𝔼 𝙅𝒙 2 2,

denoting the ℓ2 vector norm as a quadratic form as, = 𝐿sup 𝒙 2 1 𝒙 (𝔼𝙅 𝙅) 𝒙,

again, by the property of the ℓ2-operator norm,

𝐿 𝔼𝙅 𝙅 2 sup 𝒙 2 1 𝒙 2 2

= 𝐿 𝔼𝙅 𝙅 2.

Lemma 9. For a 1-Lipschitz diagonal conditioner 𝜙, the Jacobian of the location-scale reparameterization function 𝒯𝝀satisfies

Proof. For notational clarity, we will occasionally represent 𝒯𝝀as

𝒯𝝀(𝙪) = 𝒯(𝝀; 𝙪) ,

such that 𝒯𝑖(𝝀; 𝙪) denotes the 𝑖th component of 𝒯𝝀.

From the definition of 𝒯𝝀, it is straightforward to notice that its Jacobian is the concatenation of 3 block matrices 𝑱𝒎= 𝒯𝝀(𝒖)

𝒎 , 𝑱𝒔= 𝒯𝝀(𝒖)

𝒔 , and 𝑱𝑳= 𝒯𝝀(𝒖)

The 𝒎block form a deterministic identity matrix

which is shown by (Domke, 2020, Lemma 4).

The proof strategy is as follows: we will directly compute the squared Jacobian through block matrix multiplication. The key is that, after expectation, the resulting matrix becomes diagonal. Then, the ℓ2 operator norm, or maximal eigenvalue, follows trivially as the maximal diagonal element.

𝝀 = 𝔼[ 𝑱 𝒎 𝙅 𝒔 𝙅 𝑳 ] [𝑱𝒎 𝙅𝒔 𝙅𝑳]

= 𝔼[ 𝑱 𝒎𝑱𝒎 𝑱 𝒎𝙅𝒔 𝑱 𝒎𝙅𝑳 𝙅 𝒔𝑱𝒎 𝙅 𝒔𝙅𝒔 𝙅 𝒔𝙅𝑳 𝙅 𝑳𝙅𝒎 𝙅 𝑳𝙅𝒔 𝙅 𝑳𝙅𝑳 ]

= [ 𝐈 𝔼𝙅𝒔 𝔼𝙅𝑳 𝔼𝙅 𝒔 𝔼𝙅 𝒔𝙅𝒔 𝔼𝙅 𝒔𝙅𝑳 𝔼𝙅 𝑳 𝔼𝙅 𝑳𝙅𝒔 𝔼𝙅 𝑳𝙅𝑳 ] .

For 𝙅𝒔, the entries are 𝒯𝑖(𝙪)

𝑠𝑗 = 𝜙 (𝑠𝑖) 𝘶𝑗𝟙𝑖=𝑗,

which is a diagonal matrix. Thus, by Assumption 1,

𝔼𝙅𝒔= 𝐎, 𝔼𝙅 𝒔𝙅𝒔= diag (𝝓 (𝒔)) 2.

For 𝑱𝒔, the entries are 𝒯𝑖(𝝀; 𝙪)

𝐿𝑗𝑘 = 𝑢𝑘𝟙𝑖=𝑗.

To gather some intuition, the case of 𝑑= 4 looks like the following:

𝘶2 𝘶3 𝘶4 𝘶1 𝘶3 𝘶4 𝘶1 𝘶2 𝘶4 𝘶1 𝘶2 𝘶3

It is crucial to notice that the 𝑖th row does not include 𝘶𝑖. This means that, the matrix 𝙅 𝒔𝙅𝑳has entries that are either 0, or 𝜙 (𝑠𝑖) 𝘶𝑖𝘶𝑗for 𝑖 𝑗, which is 𝔼𝜙 (𝑠𝑖) 𝘶𝑖𝘶𝑗= 0 by Assumption 1. Therefore,

Finally, the elements of 𝙅 𝑳𝙅𝑳are

𝑖=0 𝘶𝑘𝘶𝑚𝟙𝑖=𝑗𝟙𝑖=𝑙= 𝟙𝑗=𝑙(𝔼𝘶𝑘𝘶𝑚) = 𝟙𝑗=𝑙𝟙𝑘=𝑚,

where the last equality follows from Assumption 1, which forms an identity matrix as

Therefore, the expected-squared Jacobian is now

𝝀 = [ 𝐈 𝔼𝙅𝒔 𝔼𝙅𝑳 𝔼𝙅 𝒔 𝔼𝙅 𝒔𝙅𝒔 𝔼𝙅 𝒔𝙅𝑳 𝔼𝙅 𝑳 𝔼𝙅 𝑳𝙅𝒔 𝔼𝙅 𝑳𝙅𝑳 ]

= [ 𝐈 diag (𝝓(𝒔)) 2

which, conveniently, is a diagonal matrix. The maximal singular value of a block-diagonal matrix is the maximal singular value of each block. And since each block is diagonal with only positive entries, the largest element forms the maximal singular value. As we assume that 𝜙is 1-Lipchitz, the element of all blocks is lower-bounded by 0 and upper-bounded by 1. Therefore, the maximal singular value of the expected-squared Jacobian is bounded by 1.

Theorem 1. Let ℓbe 𝐿ℓ-smooth and twice differentiable. Then, the following results hold:

(i) If 𝜙is linear, the energy 𝑓is 𝐿ℓ-smooth. (ii) If 𝜙is 1-Lipschitz, the energy ℓis (𝐿ℓ+ 𝐿𝑠)-smooth if and only if Assumption 3 holds.

Proof. For notational clarity, we will occasionally represent 𝒯𝝀as

𝒯𝝀(𝙪) = 𝒯(𝝀; 𝙪) ,

such that 𝒯𝑖(𝝀; 𝙪) denotes the 𝑖th component of 𝒯𝝀.

By the Leibniz and chain rule, the Hessian of the energy 𝑓follows as

2𝑓(𝝀) = 𝔼 2 𝝀ℓ(𝒯𝝀(𝙪))

2ℓ(𝒯𝝀(𝙪)) 𝒯𝝀(𝙪)

𝑖=1 D𝑖ℓ(𝒯𝝀(𝙪)) 2𝒯𝑖(𝝀; 𝙪)

When 𝒯is linear with respect to 𝝀, it is clear that we have

𝝀2 = 𝟎. (5)

Then, 𝑇non is zero. In contrast, 𝑇lin appears for both the linear and nonlinear cases. Therefore, 𝑇non fully characterizes the effect of nonlinearity in the reparameterization function.

Now, the triangle inequality yields

2𝑓(𝝀) 2 = 𝑇lin + 𝑇non 2 𝑇lin 2 + 𝑇non 2,

where equality is achieved when either term is 0. On the contrary, the reverse triangle inequality states that

|| 𝑇lin 2 𝑇non 2|| 2𝑓(𝝀) 2.

This implies that, if either 𝑇lin or 𝑇non is unbounded, the Hessian is not bounded. Thus, ensuring that 𝑇lin and 𝑇non are bounded is sufficient and necessary to establish that 𝑓is smooth.

Proof of (i) The bound on the linear part, 𝑇lin, follows from Lemma 2 as

𝑇lin 2 = 𝔼( 𝒯𝝀(𝙪)

2ℓ(𝒯𝝀(𝙪)) 𝒯𝝀(𝙪)

𝐿ℓ 𝔼( 𝒯𝝀(𝙪)

and from the 1-Lipschitzness of 𝜙, Lemma 9 yields

When 𝜙is linear, it immediately follows from Equation (5) that

2𝑓(𝝀) 2 = 𝑇lin 2 𝐿ℓ,

which is tight as shown by Domke (2020, Theorem 6).

Proof of (ii) For the nonlinear part 𝑇non, we use the fact that 𝒯𝑖(𝝀; 𝙪) is given as

𝒯𝑖(𝝀; 𝙪) = 𝑚𝑖+ 𝜙(𝑠𝑖) 𝘶𝑖+

The second derivative of 𝒯𝑖is clearly non-zero only for the nonlinear part involving 𝑠1, , 𝑠𝑑. Thus, 𝑇non follows as

𝑖=1 D𝑖ℓ(𝒯𝝀(𝙪)) 2𝒯𝑖(𝝀; 𝙪)

𝑖=1 𝑔𝑖(𝝀; 𝙪) 2𝒯𝑖(𝝀; 𝙪)

𝔼 𝑑 𝑖=1 𝑔𝑖(𝝀; 𝙪) 2𝒯𝑖(𝝀;𝙪)

Furthermore, the second-order derivatives with respect to 𝑠1, , 𝑠𝑑are given as

𝑠2 𝑗 = 𝟙𝑖=𝑗𝜙 (𝑠𝑗) .

Considering this, the only non-zero block of 𝑇non forms a diagonal matrix as

𝑖=1 𝑔𝑖(𝝀; 𝙪) 2𝒯𝑖(𝝀; 𝙪)

𝔼𝑔1 (𝝀; 𝙪) 2𝒯1(𝝀;𝙪)

𝔼𝑔𝑑(𝝀; 𝙪) 2𝒯𝑑(𝝀;𝙪)

= [ 𝔼𝑔1 (𝝀; 𝙪) 𝜙 (𝑠1) 𝘶1 𝔼𝑔𝑑(𝝀; 𝙪) 𝜙 (𝑠𝑑) 𝘶𝑑 ]

This implies that the only non-zero entries of 𝑇non lie on its diagonal. Since the ℓ2 norm of a diagonal matrix is the value of the maximal diagonal element,

𝑇non 2 max 𝑖=1, ,𝑑𝔼𝑔𝑖(𝝀; 𝙪) 𝜙 (𝑠𝑖) 𝘶𝑖 𝐿𝑠,

where 𝐿𝑠is finite constant if Assumption 3 holds. On the contrary, if a finite 𝐿𝑠does not exist, 𝑇non 2 cannot be bounded. Therefore, the energy is smooth if and only if Assumption 3 holds. When it does, the energy 𝑓is 𝐿𝑓+ 𝐿𝑠smooth.

Example 2. Let ℓ(𝒛) = (1/2) 𝒛 𝑨𝒛and the diagonal conditioner be 𝜙(𝑥) = softplus (𝑥). Then,

(i) if 𝑨is dense and the variational family is the mean-field family or (ii) if 𝑨is diagonal and the variational family is the Cholesky family,

Assumption 3 holds with 𝐿𝑠 0.26034 (max𝑖=1, ,𝑑𝐴𝑖𝑖).

(iii) If 𝑨is dense but the Cholesky family is used, Assumption 3 does not hold.

Proof. Since the gradient is ℓ(𝒛) = 𝑨𝒛, combined with reparameterization, we have

𝑔(𝝀; 𝙪) = 𝑨(𝑪𝙪+ 𝒎)

Then, for each coordinate 𝑖= 1, , 𝑑, we have

𝔼𝑔𝑖(𝝀; 𝙪) 𝘶𝑖𝜙 (𝑠𝑖) = 𝔼(

𝑘 𝑗 𝐴𝑖𝑗𝐶𝑗𝑘𝘶𝑘+

𝑗 𝐴𝑖𝑗𝑚𝑗) 𝘶𝑖𝜙 (𝑠𝑖)

𝑘 𝑗 𝐴𝑖𝑗𝐶𝑗𝑘𝔼𝘶𝑘𝘶𝑖𝜙 (𝑠𝑖) +

𝑗 𝐴𝑖𝑗𝑚𝑗𝔼𝘶𝑖𝜙 (𝑠𝑖),

and from Assumption 1,

𝑘 𝑗 𝐴𝑖𝑗𝐶𝑗𝑘𝟙𝑘=𝑖

Furthermore, the diagonal of 𝑪involves 𝜙such that

𝔼𝑔𝑖(𝝀; 𝙪) 𝘶𝑖𝜙 (𝑠𝑖) = 𝐴𝑖𝑖𝜙(𝑠𝑖) 𝜙 (𝑠𝑖)

𝑗<𝑖 𝐴𝑖𝑗𝐶𝑗𝑖𝜙 (𝑠𝑖)

For the softplus function, we have 0 < 𝜙 (𝑠) < 1 for any finite 𝑠, and we have sup 𝑠 𝜙(𝑠) 𝜙 (𝑠) 0.26034,

where the supremum was numerically approximated. Then, it is clear that 𝑇diag is finite as long as the diagonals of 𝑨are finite. Furthermore, we have the following:

(i) If 𝑨is diagonal, then 𝑇off is 0.

(ii) If 𝑨is dense but 𝑪is diagonal due to the use of the mean-field family, 𝑇off is again 0.

(iii) However, when both 𝑨and 𝑪are not diagonal, 𝑇off can be made arbitrarily large.

F.2.2 Convexity

Lemma 10. Let ℓbe convex. Then, for a convex nonlinear 𝜙, the inequality

𝝀𝑓(𝝀) , 𝝀 𝝀 𝔼 𝑔(𝝀; 𝙪) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪)

holds for all 𝝀 Λ if and only if Assumption 4 holds. For the linear parameterization, the inequality becomes equality.

Proof. First, notice that the left-hand side is

𝝀𝑓(𝝀) , 𝝀 𝝀 =

𝑖=1 𝝀𝔼ℓ(𝒯𝝀(𝙪)) , 𝝀 𝝀 = 𝔼

𝑖=1 ( 𝒯𝝀(𝙪)

𝝀𝑖 )𝑔(𝝀; 𝙪) , 𝝀 𝝀 .

By restricting us to the location-scale family, we then get

𝑚𝑖 ) 𝑔(𝝀; 𝙪) (𝑚𝑖 𝑚 𝑖)

convexity with respect to 𝒎

𝐿𝑖𝑗 ) 𝑔(𝝀; 𝙪) (𝐿𝑖𝑗 𝐿 𝑖𝑗)

convexity with respect to 𝑳

𝑠𝑖 ) 𝑔(𝝀; 𝙪) (𝑠𝑖 𝑠 𝑖)

convexity with respect to 𝒔

and plugging the derivatives of the reparameterization function,

𝑖 𝑔𝑖(𝝀; 𝙪) (𝑚𝑖 𝑚 𝑖) +

𝑗<𝑖 𝘶𝑗𝑔𝑖(𝝀; 𝙪) (𝐿𝑖𝑗 𝐿 𝑖𝑗) +

𝑖 𝜙 (𝑠𝑖) 𝘶𝑖𝑔𝑖(𝝀; 𝙪) (𝑠𝑖 𝑠 𝑖) ).

On the other hand, the right-hand side follows as

𝔼 𝑔(𝝀; 𝙪) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪)

= 𝔼( 𝑔(𝝀; 𝙪) , 𝒎 𝒎 + 𝑔(𝝀; 𝙪) , (𝑳 𝑳 ) 𝙪 + 𝑔(𝝀; 𝙪) , (𝜱(𝒔) 𝜱(𝒔 )) 𝙪 )

𝑖 𝑔𝑖(𝝀; 𝙪) (𝑚𝑖 𝑚 𝑖) +

𝑗<𝑖 𝘶𝑗𝑔𝑖(𝝀; 𝙪) (𝐿𝑖𝑗 𝐿 𝑖𝑗) +

𝑖 𝑔𝑖(𝝀; 𝙪) 𝘶𝑖(𝜙(𝑠𝑖) 𝜙(𝑠 𝑖)) ).

The convexity with respect to the 𝒎and 𝑳is clear from the first two terms they are equal. The statement is now up to the last term. That is, the statement holds if

𝑖 𝑔𝑖(𝝀; 𝙪) 𝘶𝑖𝜙 (𝑠𝑖) (𝑠𝑖 𝑠 𝑖) 𝔼

𝑖 𝑔𝑖(𝝀; 𝙪) 𝘶𝑖(𝜙(𝑠𝑖) 𝜙(𝑠 𝑖)) . (6)

For this, we will show that Assumption 4 is both necessary and sufficient.

Proof of sufficiency Equation (6) holds if

𝔼𝑔𝑖(𝝀; 𝙪) 𝘶𝑖 0

for all 𝑖= 1, , 𝑑, which is non other than Assumption 4.

Proof of necessity Suppose that the inequality

𝝀𝑓(𝝀) , 𝝀 𝝀 𝔼 𝑔(𝝀; 𝙪) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪)

holds for all 𝝀 Λ, implying

𝑖 𝔼𝘶𝑖𝑔𝑖(𝝀; 𝙪) 𝜙 (𝑠𝑖)(𝑠𝑖 𝑠 𝑖)

𝑖 𝔼𝘶𝑖𝑔𝑖(𝝀; 𝙪) ((𝜙(𝑠𝑖) 𝜙(𝑠 𝑖)).

For any 𝝀, we are free to set any 𝝀 Λ and check whether we can retrieve Assumption 4 for this specific 𝝀. Now, for each axis 𝑖, set 𝑠 𝑗= 𝑠𝑗for all 𝑗 𝑖, then

𝔼𝘶𝑖𝑔𝑖(𝝀; 𝙪) 𝜙 (𝑠𝑖)(𝑠𝑖 𝑠 𝑖) 𝔼𝘶𝑖𝑔𝑖(𝝀; 𝙪) (𝜙(𝑠𝑖) 𝜙(𝑠 𝑖)) .

Since 𝜙is assumed to be convex such that

𝜙 (𝑠𝑖) (𝑠𝑖 𝑠 𝑖) 𝜙(𝑠𝑖) 𝜙(𝑠 𝑖) .

it follows that

𝔼𝘶𝑖𝑔𝑖(𝝀; 𝙪) 0. (7)

Therefore, for any 𝝀 Λ it must be that Assumption 4 holds.

Proposition 1. We have the following:

(i) If ℓis convex, then for the mean-field family, Assumption 4 holds. (ii) For the Cholesky family, there exists a convex ℓwhere Assumption 4 does not hold.

Proof. For (i), the key property is the monotonicity of the gradient.

Proof of (i) For the mean-field family, recall that

𝐶𝑖𝑖= 𝜙(𝑠𝑖) .

Also, observe that 𝑪𝙪+ 𝒎= (𝐶11𝘶1 + 𝑚1, , 𝐶𝑑𝑑𝘶𝑑+ 𝑚𝑑). By the property of convex functions, ℓis monotone such that

ℓ(𝒛) ℓ(𝒛 ) , 𝒛 𝒛 0.

Now, by setting 𝙯= 𝑪𝙪+ 𝒎and 𝙯 = 𝑪𝙪+ 𝒎 𝐶𝑖𝑖𝘶𝑖𝐞𝑖, we obtain

ℓ(𝑪𝙪+ 𝒎) ℓ(𝑪𝙪+ 𝒎 𝐶𝑖𝑖𝘶𝑖𝐞𝑖) , 𝐶𝑖𝑖𝘶𝑖𝐞𝑖 0

for every 𝑖= 1, , 𝑑.

For the mean-field family, 𝑪𝙪+ 𝒎 𝐶𝑖𝑖𝘶𝑖𝐞𝑖is now independent of 𝘶𝑖. Thus,

𝔼𝐶𝑖𝑖𝘶𝑖D𝑖ℓ(𝑪𝙪+ 𝒎) 𝔼𝐶𝑖𝑖𝘶𝑖D𝑖ℓ(𝑪𝙪+ 𝒎 𝐞𝑖𝐶𝑖𝑖𝘶𝑖) = 𝐶𝑖𝑖(𝔼𝘶𝑖) (𝔼D𝑖𝑓(𝑪𝙪+ 𝒎 𝐞𝑖𝐶𝑖𝑖𝘶𝑖)) = 0,

where D𝑖𝑓denotes the 𝑖th axis of 𝑓. Since 𝐶𝑖𝑖> 0 by design,

𝔼𝐶𝑖𝑖𝘶𝑖D𝑖𝑓(𝑪𝙪+ 𝒎) > 0 𝔼𝘶𝑖D𝑖𝑓(𝑪𝙪+ 𝒎) > 0,

which is Assumption 4.

Proof of (ii) We provide an example that proves the statement. Let ℓ(𝒛) = 1

2𝒛 𝑨𝒛. Then,

𝑔(𝝀; 𝙪) = ℓ(𝒯𝝀(𝙪)) = 𝑨(𝑪𝙪+ 𝒎) = 𝑨𝑪𝙪+ 𝑨𝒎.

Suppose that we choose 𝝀such that

𝑪= [1 0 1 1]

and 𝒎= 𝟎. Also, setting

𝑨= [ 1 2 2 5 ] ,

we get a strongly convex function ℓ. Then,

𝑔(𝝀; 𝙪) = 𝑨𝑪𝙪= [ 1 2 2 5 ] [1 0 1 1] [𝘶1 𝘶2] = [ 1 2 3 5 ] [𝘶1 𝘶2] = [ 𝘶1 2𝘶2 3𝘶1 + 5𝘶2 ]

Finally, we have

𝔼𝑔1 (𝝀; 𝙪) 𝘶1 = 𝔼( 𝘶1 2𝘶2) 𝘶1 = 1 < 0,

which violates Assumption 4.

Lemma 11. For any function 𝑓 C1(ℝ, ℝ+) , there is no constant 0 < 𝐿< such that

|𝑓(𝑥) 𝑓(𝑦)| 𝐿|𝑥 𝑦|.

Proof. Suppose for the sake of contradiction that such 𝐿> 0 exists. Letting 𝑦 𝑥gives |𝑓 (𝑥)| 𝐿 for all 𝑥 ℝ. For each 𝑥, either 𝑓 (𝑥) 𝐿or 𝑓 (𝑥) 𝐿holds. We discuss two cases based on the value of 𝑓 (0).

If 𝑓 (0) 𝐿, we claim that 𝑓 (𝑥) 𝐿for all 𝑥 ℝ. Otherwise, 𝑓 (𝑥) < 𝐿for some 𝑥implies 𝑓 (𝑥) 𝐿. By the intermediate value theorem (𝑓 is continuous), there exists a point 𝑦between 0 and 𝑥that attains the value 𝑓 (𝑦) = 0, which is a contradiction.

Now that 𝑓 (𝑥) 𝐿> 0 for all 𝑥, 𝑓is an increasing function. For any 𝑥< 0, we have

𝑓(𝑥) = 𝑓(𝑥) 𝑓(0) + 𝑓(0) = |𝑓(𝑥) 𝑓(0)| + 𝑓(0) 𝐿|𝑥| + 𝑓(0).

Here, we can plug 𝑥 = 𝑓(0)

𝑓(𝑥 ) = 𝐿||| 𝑓(0)

𝐿 ||| + 𝑓(0) = |𝑓(0)| + 𝑓(0) 0,

which implies that 𝑓(𝑥 ) ℝ+, which is a contradiction.

Now we discuss the second case 𝑓 (0) 𝐿. By a similar argument, 𝑓 (𝑥) 𝐿for all 𝑥 ℝ. Thus, 𝑓is a decreasing function. For any 𝑥> 0, we have

𝑓(𝑥) = 𝑓(𝑥) 𝑓(0) + 𝑓(0) = |𝑓(𝑥) 𝑓(0)| + 𝑓(0) 𝐿𝑥+ 𝑓(0).

Picking 𝑥 = 𝑓(0)

𝐿 results in 𝑓(𝑥 ) ℝ+, which is a contradiction.

Theorem 2. Let ℓbe 𝜇-strongly convex. Then, we have the following:

(i) If 𝜙is linear, the energy 𝑓is 𝜇-strongly convex. (ii) If 𝜙is convex, the energy 𝑓is convex if and only if Assumption 4 holds. (iii) If 𝜙is such that 𝜙 C1 (ℝ, ℝ+), the energy 𝑓is not strongly convex.

Proof. The special case (i) is proven by Domke (2020, Theorem 9). We focus on the general statement (ii).

If ℓis 𝜇-strongly convex, the inequality

ℓ(𝒛) ℓ(𝒛 ) ℓ(𝒛 ) , 𝒛 𝒛 + 𝜇

2 𝒛 𝒛 2 2 (8)

holds, where the general convex case is obtained as a special case with 𝜇= 0. The goal is to relate this to the (𝜇-strong-)convexity of the energy with respect to the variational parameters given by

𝑓(𝝀) 𝑓(𝝀 ) 𝝀𝑓(𝝀 ) , 𝝀 𝝀 + 𝜇

Proof of (ii) Plugging the reparameterized latent variables to Equation (8) and taking the expectation, we have

𝔼ℓ(𝒯𝝀(𝙪)) 𝔼ℓ(𝒯𝝀 (𝙪)) 𝔼 ℓ(𝒯𝝀 (𝙪)) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) + 𝜇

2 𝔼 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 2 2

𝑓(𝝀) 𝑓(𝝀 ) 𝔼 ℓ(𝒯𝝀 (𝙪)) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) + 𝜇

2 𝔼 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 2 2

𝑓(𝝀) 𝑓(𝝀 ) 𝔼 𝑔(𝝀 ; 𝙪) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) + 𝜇

2 𝔼 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 2 2 Thus, the energy is convex if and only if 𝔼 𝑔(𝝀; 𝙪) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 𝑓(𝝀) , 𝝀 𝝀 holds. This is established by Lemma 10.

Proof of (iii) We now prove that, under the nonlinear parameterization, the energy cannot be strongly convex. When the energy is convex, it is also strongly convex if and only if 𝜇 2 𝔼 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 2 2 𝜇

From the proof of Domke (2020, Lemma 5), it follows that

𝔼 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 2 2 = 𝑪 𝑪 2 F + 𝒎 𝒎 2 2. Furthermore, under nonlinear parameterizations,

𝑪 𝑪 2 F + 𝒎 𝒎 2 2

= (𝑫𝜙(𝒔) 𝑫𝜙(𝒔 )) (𝑳 𝑳 )

F + 𝒎 𝒎 2 2,

expanding the quadratic,

= 𝑫𝜙(𝒔) 𝑫𝜙(𝒔 )

F + 𝑳 𝑳 2 F 2 𝑫𝜙(𝒔) 𝑫𝜙(𝒔 ) , 𝑳 𝑳 F + 𝒎 𝒎 2 2,

and since 𝑫𝜙(𝒔) and 𝑳reside in different sub-spaces, they are orthogonal. Thus,

= 𝑫𝜙(𝒔) 𝑫𝜙(𝒔 )

F + 𝑳 𝑳 2 F + 𝒎 𝒎 2 2

= 𝜙(𝒔) 𝜙(𝒔 ) 2 2 + 𝑳 𝑳 2 F + 𝒎 𝒎 2 2. (9)

For the energy term to be strongly convex, Equation (9) must be bounded below by 𝝀 𝝀 2 2. Evidently, this implies that a necessary and sufficient condition is that ||𝜙(𝑠𝑖𝑖) 𝜙(𝑠 𝑖𝑖)|| 𝐿||𝑠𝑖𝑖 𝑠 𝑖𝑖|| by some constant 0 < 𝐿< . Notice that the direction of the inequality is reversed from the Lipschitz condition. Unfortunately, there is no such continuous and differentiable function 𝜙 ℝ ℝ+, as established by Lemma 11. Thus, for any diagonal conditioner 𝜙 C1 (ℝ, ℝ+), the energy cannot be strongly convex.

F.3 Convergence of Black-Box Variational Inference

F.3.1 Vanilla Black-Box Variational Inference

Theorem 5. Let the variational family satisfy Assumption 2, the likelihood satisfy Assumption 5, and the assumptions of Corollary 1 hold such that the ELBO, 𝐹, is 𝐿𝐹-smooth with 𝐿𝐹= 𝐿ℓ+ 𝐿𝜙+ 𝐿𝑠. Then, if the stepsize satisfy 𝛾< 1/𝐿𝐹, the iterates of BBVI with SGD and the 𝑀-sample reparameterization gradient estimator satisfy

min 0 𝑡 𝑇 1 𝔼 𝐹(𝝀𝑡) 2 2 𝛾2𝐿𝐹𝐿ℓ𝜅𝐶(𝑑, 𝜑)

𝑀 ( 𝒛joint 𝒛like 2 2 + 2 (𝐹 𝑓 L))

𝛾𝑇(1 + 𝛾2 4𝐿𝐹𝐿ℓ𝜅

𝑇 (𝐹(𝝀0) 𝐹 ) .

𝒛joint = projℓ(𝒛) is the projection of 𝒛onto set of minimizers of ℓ

𝒛like = projℓlike (𝒛) is the projection of 𝒛onto set of minimizers of ℓlike,

𝜅= 𝐿ℓ/𝜇 is the condition number, 𝐹 = inf 𝝀 Λ 𝐹(𝝀) ,

ℓ like = inf 𝝀 ℝ𝑑ℓlike (𝒛) ,

𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑 for the Cholesky nonlinear,

𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field nonlinear,

𝑀 is the number of Monte Carlo samples.

Proof. Khaled & Richtárik (2023, Theorem 2) show that, if the objective function 𝐹is 𝐿𝐹-smooth and the stochastic gradients satisfy the 𝐴𝐵𝐶given as

2 𝐴(𝐹(𝝀) 𝐹 ) + 𝐵 𝐹 2 2 + 𝐶

for some 0 < 𝐴, 𝐵, 𝐶< , SGD guarantees

min 0 𝑡 𝑇 1 𝔼 𝐹(𝝀𝑡) 2 2 𝐿𝐹𝐶𝛾+ 2(1 + 𝐿𝐹𝛾2𝐴) 𝑇

𝛾𝑇 (𝐹(𝝀0) 𝐹 ) .

Under the conditions of Corollary 1, 𝐹is 𝐿𝐹-smooth with 𝐿𝐹= 𝐿ℓ+ 𝐿𝑠+ 𝐿𝜙. Furthermore, under Assumption 5, Kim et al. (2023) show that the Monte Carlo gradient estimates satisfy

2 4𝐿2 ℓ𝐶(𝑑, 𝜑)

𝜇𝑀 (𝐹(𝝀) 𝐹 ) + 𝐵 𝐹 2 2

+ 2𝐿2 ℓ𝐶(𝑑, 𝜑)

𝜇𝑀 𝒛joint 𝒛like 2 2 + 4𝐿2 ℓ𝐶(𝑑, 𝜑)

𝜇𝑀 (𝐹 ℓ like) ,

This means that the 𝐴𝐵𝐶condition is satisfied with constants

𝐴= 4𝐿2 ℓ 𝜇𝑀𝐶(𝑑, 𝜑) , 𝐵= 1, 𝐶= 2𝐿2 ℓ 𝜇𝑀𝐶(𝑑, 𝜑) 𝒛joint 𝒛like 2 2 + 4𝐿2 ℓ 𝜇𝑀𝐶(𝑑, 𝜑) (𝐹 ℓ like) .

Plugging these constants in, we obtain

min 0 𝑡 𝑇 1 𝔼 𝑓(𝝀𝑡) 2 2 𝛾2𝐿𝐹𝐿2 ℓ𝐶(𝑑, 𝜑) 𝜇𝑀 ( 𝒛joint 𝒛like 2 2 + 2 (𝐹 ℓ like))

𝛾𝑇(1 + 𝛾2𝐿𝐹 4𝐿2 ℓ 𝜇𝑀𝐶(𝑑, 𝜑))

𝑇 (𝐹(𝝀0) 𝐹 ) .

Substituting the condition number yields the stated result.

Theorem 3. Let Assumption 2 hold, the likelihood satisfy Assumption 5, and the assumptions of Corollary 1 hold such that the ELBO 𝐹is 𝐿𝐹-smooth with 𝐿𝐹= 𝐿ℓ+ 𝐿𝜙+ 𝐿𝑠. Then, the iterates generated by BBVI through Equation (1) and the 𝑀-sample reparameterization gradient include an 𝜖-stationary point such that min0 𝑡 𝑇 1 𝔼 𝐹(𝝀𝑡) 2 𝜖for any 𝜖> 0 if

𝑇 𝒪( (𝐹(𝝀0) 𝐹 ) 2𝐿𝐹𝐿2 ℓ𝐶(𝑑, 𝑘𝜑) 𝜇𝑀𝜖4 )

for some fixed stepsize 𝛾, where 𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the Cholesky family and 𝐶(𝑑, 𝜑) =

2𝑘𝜑 𝑑+ 1 for the mean-field family.

Proof. As a corollary to Theorem 5, Khaled & Richtárik (2023, Corollary 1) show that, for an 𝐿𝐹smooth objective function 𝐹, a gradient estimator satisfying the ABC condition, an 𝜖-stationary point can be encountered if

𝐿𝐹𝐴𝑇 , 1 𝐿𝐹𝐵, 𝜖 2𝐿𝐹𝐶) , 𝑇 12 (𝐹(𝝀0) 𝐹 ) 𝐿𝐹

𝜖2 max (𝐵, 12 (𝐹(𝝀0) 𝐹 ) 𝐴

Under Assumption 5, Kim et al. (2023) show that the Monte Carlo gradient estimates satisfy

2 4𝐿2 ℓ𝐶(𝑑, 𝜑)

𝜇𝑀 (𝐹(𝝀) 𝐹 ) + 𝐵 𝐹 2 2

+ 2𝐿2 ℓ𝐶(𝑑, 𝜑)

𝜇𝑀 𝒛joint 𝒛like 2 2 + 4𝐿2 ℓ𝐶(𝑑, 𝜑)

𝜇𝑀 (𝐹 ℓ like) ,

This means that the 𝐴𝐵𝐶condition is satisfied with constants

𝐴= 4𝐿2 𝑓 𝜇𝑀𝐶(𝑑, 𝜑) , 𝐵= 1, 𝐶= 2𝐿2 𝑓 𝜇𝑀𝐶(𝑑, 𝜑) ( 𝒛joint 𝒛like 2 2 + 2 (𝐹 𝑓 L)) .

𝒛joint = projℓ(𝒛) is the projection of 𝒛onto set of minimizers of ℓ

𝒛like = projℓlike (𝒛) is the projection of 𝒛onto set of minimizers of ℓlike,

𝐹 = inf 𝝀 Λ 𝐹(𝝀) ,

ℓ like = inf 𝝀 ℝ𝑑ℓlike (𝒛) ,

𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑 for the Cholesky family,

𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family,

𝑀 is the number of Monte Carlo samples.

Plugging these constants in, we obtain

𝑇 12 (𝐹(𝝀0) 𝐹 ) 𝐿𝐹

𝜖2 max (1, 48 (𝐹(𝝀0) 𝐹 ) 𝐿2 ℓ𝐶(𝑑, 𝜑) 𝜇𝑀𝜖2 , 8𝐿2 ℓ𝐶(𝑑, 𝜑) ( 𝒛joint 𝒛like 2 2 + (𝐹 ℓ like))

= 𝒪((𝐹(𝝀0) 𝐹 ) 2𝐿𝐹𝐿2 ℓ𝐶(𝑑) 𝜇𝑀𝜖4 ) ,

where we omitted the dependence on 𝑘𝜑and the minimizers of ℓand ℓlike.

F.3.2 Proximal Black-Box Variational Inference

Lemma 3 (Convex Expected Smoothness). Let ℓbe 𝐿ℓ-smooth and 𝜇-strongly convex with the variational family satisfying Assumption 2 with the linear parameterization. Then,

𝔼 𝝀𝑓(𝝀; 𝙪) 𝝀 𝑓(𝝀 ; 𝙪) 2 2 2𝐿ℓ𝜅𝐶(𝑑, 𝜑) B𝑓(𝝀, 𝝀 )

holds, where B𝑓(𝝀, 𝝀 ) 𝑓(𝝀) 𝑓(𝝀 ) 𝑓(𝝀 ) , 𝝀 𝝀 is the Bregman divergence, 𝜅= 𝐿ℓ/𝜇

is the condition number, 𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the Cholesky family, and 𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family.

Proof. First, we have

𝔼 𝝀𝑓(𝝀; 𝙪) 𝝀 𝑓(𝝀 ; 𝙪) 2 2 = 𝔼 𝝀ℓ(𝒯𝝀(𝙪)) 𝝀 ℓ(𝒯𝝀 (𝙪)) 2 2

𝝀 𝑔(𝝀, 𝙪) 𝒯𝝀 (𝙪)

For the linear parameterization, the Jacobian of 𝒯𝝀does not depend on 𝝀. Therefore,

𝝀 (𝑔(𝝀, 𝙪) 𝑔(𝝀 , 𝙪))

2 and Lemma 6 yields

= 𝐽𝒯(𝙪) 𝔼 𝑔(𝝀, 𝙪) 𝑔(𝝀 , 𝙪) 2 2,

𝐽𝒯(𝒖) = 1 + 𝒖 2 2 for the Cholesky family and

𝐽𝒯(𝒖) = 1 + 𝑼2 F for the mean-field family.

From now on, we apply the strategy of Domke (2019, Theorem 3) for resolving the randomness 𝙪. That is,

𝔼𝐽𝒯(𝙪) 𝑔(𝝀, 𝙪) 𝑔(𝝀 , 𝙪) 2 2 = 𝐽𝒯(𝙪) ℓ(𝒯𝝀(𝙪)) ℓ(𝒯𝝀 (𝙪)) 2 2 from the 𝐿ℓ-smoothness of 𝑓,

𝐿2 ℓ𝔼𝐽𝒯(𝙪) 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 2 2, and applying Corollary 2,

𝐿2 ℓ𝐶(𝑑, 𝜑) 𝝀 𝝀 2 2

The last step follows the approach of Kim et al. (2023), where we convert the quadratic bound into a bound involving the energy. Recall that the 𝜇-strongly convexity of ℓimplies 𝜇 2 𝒛 𝒛 2 2 ℓ(𝒛) ℓ(𝒛 ) ℓ(𝒛 ) , 𝒛 𝒛 . (10)

From Lemma 8, we have

𝐿2 ℓ𝐶(𝑑, 𝜑) 𝝀 𝝀 2 2 = 𝐿2 𝑓𝐶(𝑑, 𝜑) 𝔼 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) 2 2,

and by 𝜇-strongly convexity,

2𝐿2 ℓ 𝜇𝐶(𝑑, 𝜑) 𝔼(ℓ(𝒯𝝀(𝙪)) ℓ(𝒯𝝀 (𝙪)) ℓ(𝒯𝝀 (𝙪)) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) )

= 2𝐿2 ℓ 𝜇𝐶(𝑑, 𝜑) 𝔼(𝑓(𝝀; 𝙪) 𝑓(𝝀 ; 𝙪) 𝑔(𝝀 ; 𝙪) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) )

= 2𝐿2 ℓ 𝜇𝐶(𝑑, 𝜑) (𝑓(𝝀) 𝑓(𝝀 ) 𝔼 𝑔(𝝀 ; 𝙪) , 𝒯𝝀(𝙪) 𝒯𝝀 (𝙪) ).

Finally, by applying the equality in Lemma 10,

= 2𝐿2 ℓ 𝜇𝐶(𝑑, 𝜑) (𝑓(𝝀) 𝑓(𝝀 ) 𝑓(𝝀 ) , 𝝀 𝝀 ).

Lemma 12 (Variance Transfer). Let ℓbe 𝐿ℓ-smooth and 𝜇-strongly convex with the variational family satisfying Assumption 2 with the linear parameterization. Also, let ˆ 𝑓be an 𝑀-sample gradient estimator of the energy. Then,

tr 𝕍ˆ 𝑓(𝝀) 4𝐿ℓ𝜅𝐶(𝑑, 𝜑)

𝑀 B𝑓(𝝀, 𝝀 ) + 2 tr 𝕍ˆ 𝑓(𝝀 ) ,

𝜅= 𝐿ℓ/𝜇is the condition number, B𝑓is the Bregman divergence defined in Lemma 3, 𝐶(𝑑, 𝜑) =

𝑑+ 𝑘𝜑for the Cholesky family, and 𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family.

Proof. First, the 𝑀-sample gradient estimator is defined as

𝑚=1 𝝀𝑓(𝝀; 𝙪𝑚) ,

where 𝙪𝑚 𝜑. Since 𝙪1, , 𝙪𝑚are independent and identically distributed, we have

tr 𝕍ˆ 𝑓(𝝀) = 1

𝑀tr 𝕍 𝝀𝑓(𝝀; 𝙪) .

From here, given Lemma 3, the proof is identical with that of Garrigos & Gower (2023, Lemma 8.20), except for the constants.

Theorem 6. Let ℓbe 𝐿ℓ-smooth and 𝜇-strongly convex. Then, BBVI with proximal SGD in Equation (2), 𝑀-Monte Carlo samples, a variational family satisfying Assumption 2, the linear parameterization, and a fixed stepsize 0 < 𝛾 𝑀

2𝐿ℓ𝜅𝐶(𝑑,𝜑), the iterates satisfy

𝔼 𝝀𝑇 𝝀 2 2 (1 𝛾𝜇) 𝑇 𝝀0 𝝀2 2 2 + 2𝛾𝜎2

where 𝜅= 𝐿ℓ/𝜇is the condition number, 𝜎2 is defined in Lemma 4, 𝝀 = arg min𝝀 Λ 𝐹(𝝀),

𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the Cholesky family, and 𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family.

Proof. Provided that

(A.6.1) the energy 𝑓is 𝜇-strongly convex, (A.6.2) the energy 𝑓is 𝐿ℓ-smooth, (A.6.3) the regularizer ℎis convex, (A.6.4) the regularizer ℎis lower semi-continuous, (A.6.5) the convex expected smoothness condition holds, (A.6.6) the variance transfer condition holds, and (A.6.7) the gradient variance 𝜎2 at the optimum is finite such that 𝜎2 < ,

the proof is identical to that of Garrigos & Gower (2023, Theorem 11.9), which is based on the results of Gorbunov et al. (2020, Corollary A.2).

In our setting,

(A.6.1) is established by Theorem 2, (A.6.2) is established by Theorem 1, (A.6.3) is trivially satisfied since ℎis the negative entropy, (A.6.4) is trivially satisfied since ℎis continuous, (A.6.5) is established in Lemma 3, (A.6.6) is established in Lemma 12, (A.6.7) is established in Lemma 4.

The only difference is that, we replace the constant 𝐿max in the proof of Garrigos & Gower to 𝐿ℓ𝜅𝐶(𝑑, 𝜑)/𝑀. This stems from the different constants in the variance transfer condition.

Theorem 7. Let ℓbe 𝐿ℓ-smooth and 𝜇-strongly convex. Then, for any 𝜖> 0, BBVI with proximal SGD in Equation (2), 𝑀-Monte Carlo samples, a variational family satisfying Assumption 2, and the linear parameterization guarantees 𝔼 𝝀𝑇 𝝀 2 2 𝜖if

2 𝜇 2𝜎2 , 𝑀 2𝐿ℓ𝜅𝐶(𝑑, 𝜑)) , 𝑇 max (1

𝜇2 , 2𝜅2 𝐶(𝑑, 𝜑)

𝑀 ) log (2 𝝀0 𝝀

where 𝜅= 𝐿ℓ/𝜇, 𝜎2 is defined in Lemma 4, 𝝀 = arg min𝝀 Λ 𝐹(𝝀), 𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the

Cholesky family, and 𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family.

Proof. This is a corollary of the fixed stepsize convergence guarantee in Theorem 6 as shown by Garrigos & Gower (2023, Corollary 11.10). They guarantee an 𝜖-accurate solution as long as

2 2 2𝜎 F , 1 2𝐿max ) , 𝑇 max (1

𝜖 4𝜎 F 𝜇2 , 2𝐿max

𝜇 ) log (2 𝝀0 𝝀

In our notation, 𝜎 F = 𝜎2 and 𝐿max = 𝐿ℓ𝜅𝐶(𝑑, 𝜑)/𝑀.

Theorem 8. Let ℓbe 𝐿ℓ-smooth and 𝜇-strongly convex. Then, BBVI with proximal SGD in Equation (2), the 𝑀-sample reparameterization gradient estimator, a variational family satisfying Assumption 2, the linear parameterization, 𝑇 4𝑇𝜅, and a stepsize schedule of

2𝐿ℓ𝜅𝐶(𝑑,𝜑) for 𝑡 4𝑇𝜅 2𝑡+1

(𝑡+1)2𝜇 for 𝑡> 4𝑇𝜅,

where 𝑇𝜅= 𝜅2𝐶(𝑑, 𝜑) 𝑀 1 , 𝜅= 𝐿ℓ/𝜇is the condition number, 𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the

Cholesky family, and 𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family, then the iterates satisfy

𝔼 𝝀𝑇 𝝀 2 2 16 𝑇2 𝜅 𝝀0 𝝀 2 2 e2𝑇2 + 8𝜎2

where 𝜎2 is defined in Lemma 4, e is Euler s constant, and 𝝀 = arg min𝝀 Λ 𝐹(𝝀).

Proof. Under our assumptions, Theorem 6 holds, of which the proof is essentially obtaining the recursion

𝔼 𝝀𝑡+1 𝝀 2 2 = (1 𝛾𝑡𝜇) 𝔼 𝝀𝑡 𝝀 2 2 + 2𝛾2 𝑡𝜎2.

Instead of a fixed stepsize, we can apply the decreasing stepsize rule in the proof statement, then which the proof becomes identical to that of Gower et al. (2019, Theorem 3.2). We only need to replace ℒwith 𝐿max in the proof of Garrigos & Gower (2023, Theorem 11.9). This, in our notation, is 𝐿max = 𝐿ℓ𝜅𝐶(𝑑, 𝜑)/𝑀.

Theorem 4. Let ℓbe 𝐿ℓ-smooth and 𝜇-strongly convex. Then, for any 𝜖> 0, BBVI with proximal SGD in Equation (2), the 𝑀-sample reparameterization gradient estimator, a variational family satisfying Assumption 2 with the linear parameterization guarantees 𝔼 𝝀𝑇 𝝀 2 2 𝜖if

2𝐿ℓ𝜅𝐶(𝑑,𝜑) for 𝑡 4𝑇𝜅 2𝑡+1

(𝑡+1)2𝜇 for 𝑡> 4𝑇𝜅, 𝑇 max ( 8𝜎2

𝜇2 𝜖+ 4𝑇𝜅 𝝀0 𝝀 2

where 𝜎2 is defined in Lemma 4, 𝑇𝜅= 𝜅2𝐶(𝑑, 𝜑) 𝑀 1 , 𝜅= 𝐿ℓ/𝜇is the condition number, e is Euler s constant, 𝝀 = arg min𝝀 Λ 𝐹(𝝀), 𝐶(𝑑, 𝜑) = 𝑑+ 𝑘𝜑for the Cholesky family, and

𝐶(𝑑, 𝜑) = 2𝑘𝜑 𝑑+ 1 for the mean-field family.

Proof. The computational complexity follows from the smallest number of iterations 𝑇such that

𝔼 𝝀𝑇 𝝀 2 2 16𝑇2 𝜅 𝝀0 𝝀 2 2 e2𝑇2 + 8𝜎2

By multiplying both sides with 𝑇2 as

𝜇2 𝑇 16𝑇2 𝜅 𝝀0 𝝀 2 2 e2 0, (11)

we can see that we are looking for the smallest positive integer that is larger than the solution of a quadratic equation with respect to 𝑇. This is given as

2 + 64𝜖 𝑇2𝜅 𝝀0 𝝀 2 2 e2

Applying the inequality 𝑎+ 𝑏 𝑎+ 𝑏,

2 + 64𝜖 𝑇2𝜅 𝝀0 𝝀 2 2 e2

𝜇2 ) + 64𝜖 𝑇2𝜅 𝝀0 𝝀 2 2 e2 2𝜖

𝜇2 + 𝜖 8𝑇𝜅 𝝀0 𝝀 2

𝜇2 𝜖+ 4𝑇𝜅 𝝀0 𝝀 2

Thus, 𝔼 𝝀𝑇 𝝀 2 2 𝜖can be satisfied with a number of iterations at least

𝑇 max ( 8𝜎2

𝜇2 𝜖+ 4𝑇𝜅 𝝀0 𝝀 2

e 𝜖 , 4𝑇𝜅) .

G Details of Experimental Setup

Table 2: Summary of Datasets and Problems

Abbrev. Model Dataset 𝑑 𝑁

LME-election Linear Mixed Effects 1988 U.S. presidential election (Gelman & Hill, 2007) 90 11,566 LME-radon U.S. household radon levels (Gelman & Hill, 2007) 391 12,573

BT-tennis Bradley-Terry ATP World Tour tennis 6030 172,199

Linear Regression

KEGG-undirected (Shannon et al., 2003) 31 63,608 LR-song million songs (Bertin-Mahieux et al., 2011) 94 515,345 LR-buzz buzz in social media (Kawala et al., 2013) 81 583,250 LR-electric household electric 15 2,049,280

AR-ecg Sparse Autoregression Long-term ST ECG (Jager et al., 2003) 63 20,642,000

Linear Regression (LR-*) We consider a basic Bayesian hierarchical linear regression model

𝜎𝛼 𝒩+ (0, 102) , 𝜎𝜷 𝒩+ (0, 102) , 𝜎 𝒩+ (0, 0.32)

𝜷 𝒩(0, 𝜎2 𝜷𝑰) , 𝛼 𝒩(0, 𝜎2 𝛼) ,

𝑦𝑖 𝒩(𝜷 𝒙𝑖+ 𝛼, 𝜎2) ,

where a weakly informative half-normal hyperprior 𝒩+, a normal distribution with the support restricted to ℝ+, is assigned on the hyperparameters. For the datasets, we consider large-scale regression problems obtained from the UCI repository (Dua & Graff, 2017), shown in Table 2. For all datasets, we standardize the regressors 𝒙𝑖and the outcomes 𝑦𝑖.

Radon Levels (MLE-radon) MLE-radon is a radon level regression problem by Gelman & Hill (2007). It fits a hierarchical mixed-effects model for estimating household radon levels across different counties while considering the floor elevation of each site. The model is described as

𝜎 𝒩+ (0, 12) , 𝜎𝛼 𝒩+ (0, 12) , 𝜇𝛼 𝒩(0, 102) , 𝝐 𝒩(0, 102𝐈)

𝛽1 𝒩(0, 102) , 𝛽2 𝒩(0, 102) 𝜶= 𝜇𝛼+ 𝜎𝛼𝝐

𝜇𝑖= 𝛼[county𝑖] + 𝛽1 log (uppm𝑖) + floor𝑖𝛽2 log radon𝑖 𝒩(𝜇𝑖, 𝜎2) ,

which uses variable slopes and intercepts with non-centered parameterization. The dataset was obtained from Posterior DB (Magnusson et al., 2022). Also, for the radon regression problem, the Minnesota subset is often used due to computational reasons. Here, we use the full national dataset.

Presidential Election (MLE-election) MLE-election is a model for studying the effects of sociological factors on the 1988 United States presidential election (Gelman & Hill, 2007). The model is described as

𝜎age 𝒩(0, 1002) , 𝜎edu 𝒩(0, 1002) , 𝜎age edu 𝒩(0, 1002)

𝜎state 𝒩(0, 1002) , 𝜎region 𝒩(0, 1002) ,

𝒃age 𝒩(0, 𝜎2 age𝐈) , 𝒃edu 𝒩(0, 𝜎2 edu𝐈) , 𝒃age edu 𝒩(0, 𝜎2 age edu𝐈) ,

𝒃state 𝒩(0, 𝜎2 state𝐈) , 𝒃region 𝒩(0, 𝜎2 region𝐈)

𝜷 𝒩(0, 1002𝐈) 𝑝𝑖= 𝛽1 + 𝛽2 black𝑖+ 𝛽3 female𝑖+ 𝛽4 𝑣prev,𝑖+ 𝛽5 female𝑖black𝑖 + 𝑏age[age𝑖] + 𝑏edu[edu𝑖] + 𝑏age edu[age𝑖edu𝑖] + 𝑏state[state𝑖] + 𝑏region[region𝑖]

𝑦𝑖 bernoulli (𝑝𝑖) .

The dataset was obtained from Posterior DB (Magnusson et al., 2022).

Bradley-Terry (BT-Tennis) BT-Tennis is a Bradley-Terry model for estimating the skill of professional tennis players used by Giordano et al. (2023). The model is described as

𝜎 𝒩+ (0, 1)

𝜽 𝒩(𝟎, 𝜎2𝐈) 𝑝𝑖 𝜃[win𝑖] 𝜃[los𝑖] 𝑦𝑖 bernoulli (𝑝) ,

where win𝑖, los𝑖are the indices of the winning and losing players for the 𝑖th game, respectively. While we subsample over the games 𝑖= 1, , 𝑁, each player s involvement is sparse in that each player plays only a handful of games. Consequently, the subsampling noise is substantial. Therefore, we use a larger batch size of 500. Similarly to Giordano et al. (2023), we use the ATP World Tour data publically available online 1.

Autoregression (AR-ecg) AR-ecg is a linear autoregressive model. Here, we use a Student-t likelihood as originally proposed by Christmas & Everson (2011). While they originally imposed an automatic relevance detection prior on the autoregressive coefficients, we instead set a horseshoe shrinkage prior (Carvalho et al., 2009, 2010). Since the horseshoe is known to result in complex posterior geometry, this should make the problem more challenging. The model is described as

𝛼𝑑= 10 2, 𝛽𝑑= 10 2, 𝛼𝑑= 10 2, 𝛽𝑑= 10 2, 𝑑 gamma (𝛼𝑑, 𝛽𝑑) ,

𝜎 1 inverse-gamma (𝛼𝜎, 𝛽𝜎) , 𝜏 cauchy+ (0, 1) ,

𝝀 cauchy+ (𝟎, 𝟏) ,

𝜽 𝒩(0, 𝜏diag (𝝀)) 𝑦[𝑛] stduent-t (𝑑, 𝜃1𝑦[𝑛 1] + 𝜃2𝑦[𝑛 2] + + 𝜃𝑃𝑦[𝑛 𝑃], 𝜎) ,

where 𝑑is the degrees-of-freedom for the Student-t likelihood, cauchy+ is a half-Cauchy prior.

For the dataset, we use the long-term electrocardiogram measurements of Jager et al. (2003) obtained from Physionet (Goldberger et al., 2000). The data instance we used has a duration of 23 hours sampled at 250 Hz with 12-bit resolution over a range of 10 millivolts. During the experiments, we observed that the hyperparameters suggested by Christmas & Everson are sensitive to the signal amplitude. Therefore, we scaled the signal amplitude to be 10.

1https://datahub.io/sports-data/atp-world-tour-tennis-data

H Additional Experimental Results

Base Stepsize (α) 10 4 10 3 0.01 0.1

Iteration (T) 1 25k 50k

(a) LME-election

Base Stepsize (α) 10 4 10 3 0.01 0.1

Iteration (T) 1 25k 50k

(b) LME-radon

Base Stepsize (α) 10 4 10 3 0.01 0.1

Iteration (T) 1 25k 50k

(c) BT-tennis

Base Stepsize (α) 10 4 10 3 0.01 0.1

Iteration (T) 1 25k 50k

(d) LR-keggu

Figure 5: BBVI convergence speed (ELBO v.s. Iteration) and robustness against stepsize (ELBO at 𝑇= 50, 000 v.s. Base stepsize). The error bands are the 80% quantiles estimated from 20 (10 for AR-eeg) independent replications. The initial point was 𝒎0 = 𝟎, 𝑪0 = 𝐈.

Base Stepsize (α) 10 4 10 3 0.01 0.1

Iteration (T) 1 25k 50k

(a) LR-song

Base Stepsize (α) 10 4 10 3 0.01 0.1

Iteration (T) 1 25k 50k

(b) LR-buzz

Base Stepsize (α) 10 4 10 3 0.01 0.1

Iteration (T) 1 25k 50k

(c) LR-electric

Base Stepsize (α) 10 4 10 3 0.01 0.1

Iteration (T) 1 25k 50k

Figure 6: BBVI convergence speed (ELBO v.s. Iteration) and robustness against stepsize (ELBO at 𝑇= 50, 000 v.s. Base stepsize). The error bands are the 80% quantiles estimated from 20 (10 for AR-eeg) independent replications. The initial point was 𝒎0 = 𝟎, 𝑪0 = 𝐈.