# on_the_convergence_of_blackbox_variational_inference__f09888a3.pdf On the Convergence of Black-Box Variational Inference Kyurae Kim University of Pennsylvania kyrkim@seas.upenn.edu Jisu Oh North Carolina State University joh26@ncsu.edu Kaiwen Wu University of Pennsylvania kaiwenwu@seas.upenn.edu Yi-An Ma University of California, San Diego yianma@ucsd.edu Jacob R. Gardner University of Pennsylvania jacobrg@seas.upenn.edu We provide the first convergence guarantee for black-box variational inference (BBVI) with the reparameterization gradient. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Notably, our analysis reveals that certain algorithm design choices commonly employed in practice, such as nonlinear parameterizations of the scale matrix, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations and thus achieves the strongest known convergence guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems. 1 Introduction Despite the practical success of black-box variational inference (BBVI Kucukelbir et al., 2017 Ranganath et al., 2014 Titsias & Lรกzaro-Gredilla, 2014), also known as stochastic gradient variational Bayes and Monte Carlo variational inference, whether it converges under appropriate assumptions on the target problem have been an open problem for a decade. While our understanding of BBVI has been advancing (Bhatia et al., 2022 Challis & Barber, 2013 Domke, 2019, 2020 Hoffman & Ma, 2020), a full convergence guarantee that extends to the practical implementations as used in probabilistic programming languages (PPL) such as Stan (Carpenter et al., 2017), Turing (Ge et al., 2018), Tensorflow Probability (Dillon et al., 2017), Pyro (Bingham et al., 2019), and Py MC (Patil et al., 2010) has yet to be demonstrated. Due to our lack of understanding, a consensus on how we should implement our BBVI algorithms has yet to be achieved. For example, when the variational family is chosen to be the location-scale family, the scale matrix can be parameterized linearly or nonlinearly, and both parameterizations are used by default in popular software packages. (See Table 1 in Kim et al. 2023.) Surprisingly, as we will show, seemingly innocuous design choices like these can substantially impact the convergence of BBVI. This is critical as BBVI has been shown to be less robust (e.g., sensitive to initial points, stepsizes, and such) than competing inference methods such as Markov chain Monte Carlo (MCMC). (See Dhaka et al., 2020 Domke, 2020 Welandawe et al., 2022 Yao et al., 2018.) Instead, the evaluation of BBVI algorithms has been relying on expensive empirical evaluations (Agrawal et al., 2020 Dhaka et al., 2021 Giordano et al., 2018 Yao et al., 2018). 37th Conference on Neural Information Processing Systems (Neur IPS 2023). To rigorously analyze the design of BBVI algorithms, we establish the first convergence guarantee for the implementations precisely as used in practice. We provide results for BBVI with the reparameterization gradient (RP Kingma & Welling, 2014 Titsias & Lรกzaro-Gredilla, 2014) and the location-scale variational family, arguably the most widely used combination in practice. Our results apply to log-smooth posteriors, which is a routine assumption for analyzing the convergence of stochastic optimization (Garrigos & Gower, 2023) and sampling algorithms (Dwivedi et al., 2019, 2.3). The key is to show that evidence lower bound (ELBO Jordan et al., 1999) satisfies regularity conditions required by convergence proofs of stochastic gradient descent (SGD Bottou, 1999 Nemirovski et al., 2009 Robbins & Monro, 1951), the workhorse underlying BBVI. Our analysis reveals that nonlinear scale matrix parameterizations used in practice are suboptimal: they provably break strong convexity and sometimes even convexity. Even if the posterior is strongly log-concave, the ELBO is not strongly convex anymore. This contrasts with linear parameterizations, which guarantee the ELBO to be strongly convex if the posterior is strongly log-concave (Domke, 2020). Under linear parameterizations, however, the ELBO is no longer smooth, making optimization challenging. Because of this, Domke (2020) proposed to use proximal SGD, which Agrawal & Domke (2021, Appendix A) report to have better performance than vanilla SGD with nonlinear parameterizations. Indeed, we show that BBVI with proximal SGD achieves the fastest known converges rates of SGD, unlike vanilla BBVI. Thus, we provide a concrete reason for employing proximal SGD. We evaluate this insight on large-scale Bayesian inference problems by implementing an Adam-like (Kingma & Ba, 2015) variant of proximal SGD proposed by Yun et al. (2021). Concurrently to this work, convergence guarantees on BBVI with the RP and the sticking-the-landing estimator (STL Roeder et al., 2017) under the linear parameterization were published by Domke et al. (2023). To achieve this, they show that a quadratic bound on the gradient variance is sufficient to guarantee the convergence of projected and proximal SGD. In contrast, we focus on analyzing the ELBO under nonlinear parameterizations and connect it to existing analysis strategies. A more in-depth comparison of the two works is provided in Appendix E. Convergence Guarantee for BBVI: Theorem 3 establishes a convergence guarantee for BBVI with assumptions matching the implementations used in practice. That is, without algorithmic simplifications and unrealistic assumptions such as bounded domain or bounded support. Optimality of Linear Parameterizations: Theorem 2 shows that, for location-scale variational families, nonlinear scale parameterizations prevent the ELBO from being strongly-convex even when the target posterior is strongly log-concave. Convergence Guarantee for Proximal BBVI: Theorem 4 guarantees that, if proximal SGD is used, BBVI on ๐œ‡-strongly log-concave posteriors can obtain a solution ๐œ–-close to the global optimum with ๐’ช(1/๐œ–) iterations. Evaluation of Proximal BBVI in Practice: In Section 5, we evaluate the utility of proximal SGD on large-scale Bayesian inference problems. 2 Background Notation Random variables are denoted in serif (e.g., ๐˜น, ๐™ญ), vectors are in bold (e.g., ๐’™, ๐™ญ), and matrices are in bold capitals (e.g. ๐‘จ). For a vector ๐’™ โ„๐‘‘, we denote the inner product as ๐’™ ๐’™and ๐’™, ๐’™ , the โ„“2-norm as ๐’™ 2 = ๐’™ ๐’™. For a matrix ๐‘จ, ๐‘จ F = tr (๐‘จ ๐‘จ) denotes the Frobenius norm. ๐•Š๐‘‘ ++ is the set of positive definite matrices. For some function ๐‘“, D๐‘–๐‘“denotes the ๐‘–th coordinate of ๐‘“, and C๐‘˜(๐’ณ, ๐’ด) is the set of ๐‘˜-time differentiable continuous functions mapping from ๐’ณto ๐’ด. 2.1 Black-Box Variational Inference Variational inference (VI, Blei et al., 2017 Jordan et al., 1999 Zhang et al., 2019) aims to minimize the exclusive (or backward/reverse) Kullback-Leibler (KL) divergence as: minimize ๐€ ฮ› DKL (๐‘ž๐€, ๐œ‹) ๐”ผ๐™ฏ ๐‘ž๐€ log ๐œ‹(๐™ฏ) โ„(๐‘ž๐€) , where DKL (๐‘ž๐€, ๐œ‹) is the KL divergence, โ„ is the differential entropy, ๐œ‹ is the (target) posterior distribution, and ๐‘ž๐€ is the variational distribution, While alternative approaches to VI (Dieng et al., 2017 Hernandez-Lobato et al., 2016 Kim et al., 2022 Naesseth et al., 2020) exist, so far, exclusive KL minimization has been the most successful. We thus use exclusive KL minimization as a synonym for VI, following convention. Equivalently, one minimizes the negative evidence lower bound (ELBO, Jordan et al., 1999) ๐น: minimize ๐€ ฮ› ๐น(๐€) ๐”ผ๐™ฏ ๐‘ž๐€ log ๐‘(๐’›, ๐’™) โ„(๐‘ž๐€) , where log ๐‘(๐’›, ๐’™) is the joint likelihood, which is proportional to the posterior as ๐œ‹(๐’›) ๐‘(๐’›, ๐’™) = ๐‘(๐’™ ๐’›) ๐‘(๐’›), where ๐‘(๐’™ ๐’›) is the likelihood and ๐‘(๐’›) is the prior. 2.2 Variational Family In this work, we focus on the following variational family. ( d= is equivalence in distribution.) Definition 1 (Reparameterized Family). Let ๐œ‘be some ๐‘‘-variate distribution. Then, ๐‘ž๐€that can be equivalently represented as ๐™ฏ ๐‘ž๐€ ๐™ฏ d= ๐’ฏ๐€(๐™ช) ; ๐™ช ๐œ‘, is said to be part of a reparameterized family generated by the base distribution ๐œ‘and the reparameterization function ๐’ฏ๐€. Definition 2 (Location-Scale Reparameterization Function). ๐’ฏ๐€ โ„๐‘‘ โ„๐‘‘defined as ๐’ฏ๐€(๐’–) ๐‘ช๐’–+ ๐’Ž with ๐€containing the parameters for forming the location ๐’Ž โ„๐‘‘and scale ๐‘ช= ๐‘ช(๐€) โ„๐‘‘ ๐‘‘is called the location-scale reparameterization function. The location-scale family enables detailed theoretical analysis, as demonstrated by (Domke, 2019, 2020 Fujisawa & Sato, 2021 Kim et al., 2023), and includes the most widely used variational families such as the Student-t, elliptical, and Gaussian families (Titsias & Lรกzaro-Gredilla, 2014). Handling Constrained Support For common choices of the base distribution ๐œ‘, the support of ๐‘ž๐€is the whole โ„๐‘‘. Therefore, special treatment is needed when the support of ๐œ‹is constrained. Kucukelbir et al. (2017) proposed to handle this by applying diffeomorphic transformation denoted with ๐œ“, often called bjectors (Dillon et al., 2017 Fjelde et al., 2020 Leger, 2023), to ๐‘ž๐€such that ๐žฏ ๐‘ž๐œ“,๐€ ๐žฏ ๐‘‘= ๐œ“ 1(๐™ฏ); ๐™ฏ ๐‘ž๐€, such that the support of ๐‘ž๐œ“,๐€matches that of ๐œ‹. For example, when the support of ๐œ‹is โ„+, one can choose ๐œ“ 1 = exp. This approach, known as automatic differentiation VI (ADVI), is now standard in most modern PPLs. Why focus on posteriors with unconstrained supports? When bijectors are used, the entropy of ๐‘ž๐€, โ„(๐‘ž๐€), needs to be adjusted by the Jacobian of ๐œ“(Kucukelbir et al., 2017), ๐‘ฑ๐œ™ 1. However, applying the transformation to ๐œ‹instead of ๐‘ž๐€is mathematically equivalent and more convenient. In fact, bijectors can be automatically incorporated into our notation by implicitly setting ๐‘(๐’™ ๐’›) = ๐‘(๐’™ ๐œ“ 1 (๐’›)) and ๐‘(๐’›) = ๐‘(๐œ“ 1 (๐’›)) ||๐‰๐œ“ 1 (๐’›)||, such that ๐œ‹(๐œป) ๐‘(๐’™ ๐œป) ๐‘(๐œป), where ๐œ‹is the constrained posterior that we are actually interested in. Therefore, our setup in Section 2.1, where the domain of ๐™ฏis taken to be the unconstrained โ„๐‘‘, already encompasses constrained posteriors through ADVI. Lastly, we impose light assumptions on the base distribution ๐œ‘, which are already satisfied by most variational families used in practice. (i.i.d.: independently and identically distributed.) Assumption 1 (Base Distribution). ๐œ‘is a ๐‘‘-variate distribution such that ๐™ช ๐œ‘and ๐™ช= (๐˜ถ1, , ๐˜ถ๐‘‘) with i.i.d. components. Furthermore, ๐œ‘is (i) symmetric and standardized such that ๐”ผ๐˜ถ๐‘–= 0, ๐”ผ๐˜ถ2 ๐‘–= 1, ๐”ผ๐˜ถ3 ๐‘–= 0, and (ii) has finite kurtosis ๐”ผ๐˜ถ4 ๐‘–= ๐‘˜๐œ‘< . The assumptions on the variational family we will use throughout this work are collectively summarized in the following assumption: Assumption 2. The variational family is the location-scale family formed by Definitions 1 and 2 with the base distribution ๐œ‘satisfying Assumption 1. 2.3 Scale Parameterizations For the scale matrix ๐‘ช(๐€) in the location-scale family, any parameterization that results in a positive-definite covariance ๐‘ช๐‘ช ๐•Š๐‘‘ ++ is valid. However, for the ELBO to ever be convex, the entropy โ„(๐‘ž๐€) must be convex, which requires the mapping ๐€ ๐‘ช๐‘ช to be convex. To ensure this, we restrict ๐‘ชto (lower) triangular matrices with strictly positive eigenvalues, essentially, Cholesky factors. This leaves two of the most common parameterizations: Definition 3 (Mean-Field Family.). ๐‘ช= ๐‘ซ๐œ™(๐’”) where the ๐‘‘elements of ๐’”forms the diagonal and ๐€ ฮ› such that ฮ› = {(๐’Ž, ๐’”) ๐’Ž โ„๐‘‘, ๐’” ๐’ฎ}. Definition 4 (Full-Rank Cholesky Family). ๐‘ช= ๐‘ซ๐œ™(๐’”) + ๐‘ณ, where the ๐‘‘elements of ๐’”forms the diagonal, ๐‘ณis ๐‘‘by-๐‘‘strictly lower triangular, and ๐€ ฮ› such that ฮ› = {(๐’Ž, ๐’”, ๐‘ณ) ๐’Ž โ„๐‘‘, ๐’” ๐’ฎ, vec (๐‘ณ) โ„(๐‘‘+1)๐‘‘/2} . Here, ๐‘†is discussed in the next paragraph, ๐‘ซ๐œ™(๐’”) โ„๐‘‘ ๐‘‘is a diagonal matrix such that ๐‘ซ๐œ™(๐’”) diag (๐“(๐’”)) = diag (๐œ™(๐‘ 1) , , ๐œ™(๐‘ ๐‘‘)), and ๐œ™is a function we call a diagonal conditioner. Linear v.s. Nonlinear Parameterizations When the diagonal conditioner is a linear function ๐œ™(๐‘ฅ) = ๐‘ฅ, we say that the covariance parameterization is linear. In this case, to ensure that ๐‘ชis a Cholesky factor, the domain of ๐’”is set as ๐’ฎ= โ„๐‘‘ +. On the other hand, by choosing a nonlinear conditioner ๐œ™ โ„ โ„+, we can make the domain of ๐’”to be the unconstrained ๐’ฎ= โ„๐‘‘. Because of this, nonlinear conditioners such as the softplus (๐‘ฅ) log (1 + exp (๐‘ฅ)) (Dugas et al., 2000) are frequently used in practice, especially for mean-field. (See Table 1 by Kim et al., 2023). 2.4 Problem Structure of Black-Box Variational Inference Exclusive KL minimization VI is fundamentally a composite (regularized) optimization problem ๐น(๐€) = ๐‘“(๐€) + โ„Ž(๐€) , (ELBO) where ๐‘“(๐€) ๐”ผ๐™ฏ ๐‘ž๐€โ„“(๐™ฏ) is the energy term, โ„“(๐’›) log ๐‘(๐’›, ๐’™) is the negative joint log-likelihood, and โ„Ž(๐€) โ„(๐‘ž๐œ“,๐€) is the entropic regularizer. From here, BBVI introduces more structure. CP Composite IS Infinite Sum RP Reparameterized FS Finite Sum ERM Empirical Risk Minimization Figure 1: Taxonomy of variational inference. Within BBVI, this work only considers the reparameterization gradient (BBVI RP, shown in dark red). This leaves out BBVI with the score gradient (BBVI RP, shown in light red). The set VI FS includes sparse variational Gaussian processes (Titsias, 2009), while the remaining set VI (FS IS RP) includes coordinate ascent VI (Blei et al., 2017). An illustration of the taxonomy is shown in Figure 1. In particular, BBVI has an infinite sum structure (IS). That is, it cannot be represented as a sum of finite subcomponents as in ERM. Furthermore, ๐น(๐€) = ๐”ผ๐™ช ๐œ‘๐‘“(๐€; ๐™ช) + โ„Ž(๐€) (CP IS) = ๐”ผ๐™ช ๐œ‘โ„“(๐’ฏ๐€(๐™ช)) + โ„Ž(๐€) , (CP IS RP) where ๐‘“(๐€; ๐’–) โ„“(๐’ฏ๐€(๐’–)). Theoretical Challenges The structure of BBVI has multiple challenges that have hindered its theoretical analysis: (i) the stochasticity of the Jacobian of ๐’ฏand (ii) The infinite sum structure. For Item (i), we can see that in ๐€โ„“(๐’ฏ๐€(๐’–)) = ๐’ฏ๐€(๐’–) ๐€ โ„“(๐’ฏ๐€(๐’–)) = ๐’ฏ๐€(๐’–) ๐€ ๐‘”(๐€; ๐’–) , where ๐‘”(๐€; ๐’–) ( โ„“ ๐’ฏ๐€) (๐™ช), both the Jacobian of ๐’ฏ๐€ and the gradient of the log-likelihood, ๐‘”, depend on the randomness ๐’–. Effectively decoupling the two is a major challenge to analyzing the properties of the ELBO and its gradient estimators (Domke, 2019, 2020). For Item (ii), the problem is that recent analyses of SGD (Garrigos & Gower, 2023 Gower et al., 2019 Nguyen et al., 2018 Vaswani et al., 2019) have increasingly been relying on the assumption that ๐‘“(๐€; ๐’–) is smooth for all ๐’–such that ๐€๐‘“(๐€; ๐’–) ๐€๐‘“(๐€ ; ๐’–) ๐ฟ ๐€ ๐€ for some ๐ฟ< . This is sensible if the support of ๐™ชis bounded, which is true for the ERM setting but not for the class of infinite sum (IS) problems. Previous works circumvented this issue by assuming (i) that the support of ๐™ชis bounded (Fujisawa & Sato, 2021) which implicitly changes the variational family, or (ii) that the gradient ๐‘“is bounded by a constant (Buchholz et al., 2018 Liu & Owen, 2021) which contradicts strong convexity (Nguyen et al., 2018). 3 The Evidence Lower Bound Under Nonlinear Scale Parameterizations Under the linear parameterization (๐œ™(๐‘ฅ) = ๐‘ฅ), the properties of the ELBO, such as smoothness and convexity, have been previously analyzed by Challis & Barber (2013) Domke (2020) Titsias & Lรกzaro-Gredilla (2014). We generalize these results to nonlinear conditioners. 3.1 Technical Assumptions Let ๐‘”๐‘–(๐€; ๐™ช) be the ๐‘–th coordinate of ๐™œ(๐€; ๐™ช) and recall that ๐˜ถ๐‘–denote the ๐‘–th element of ๐™ช. Establishing convexity and smoothness of the ELBO under nonlinear parameterizations depends on a pair of necessary and sufficient assumptions. To establish smoothness: Assumption 3. The gradient of โ„“under reparameterization, ๐‘”, satisfies |๐”ผ๐‘”๐‘–(๐€; ๐™ช) ๐˜ถ๐‘–๐œ™ (๐‘ ๐‘–)| ๐ฟ๐‘  for every coordinate ๐‘–= 1, ๐‘‘, any ๐€ ฮ›, and some 0 < ๐ฟ๐‘ < . Here, ๐œ™ is the second derivative of ๐œ™. The next one is required to establish convexity: Assumption 4. The gradient of โ„“under reparameterization, ๐‘”, satisfies ๐”ผ๐‘”๐‘–(๐€; ๐™ช) ๐˜ถ๐‘– 0 for every coordinate ๐‘–= 1, ๐‘‘. Intuitively, these assumption control how much โ„“and ๐’ฏ๐€rotate the randomness ๐™ช. (Notice that the assumptions are closely related to the matrix Cov (๐‘”(๐€; ๐™ช) , ๐™ช), the covariance between ๐‘”and ๐™ช.) However, the peculiar aspect of these assumptions is that they are not implied by the convexity and smoothness of โ„“. Especially, Assumption 3 strongly depends on the internals of โ„“. 3.2 Smoothness of the Entropy Under the linear parameterization, Domke (2020) has previously shown that the entropic regularizer term โ„Žis not smooth. This fact immediately implies the ELBO is not smooth. However, certain nonlinear conditioners do result in a smooth regularizer. Lemma 1. If the diagonal conditioner ๐œ™is ๐ฟโ„Ž-log-smooth, then the entropic regularizer โ„Ž(๐€) is ๐ฟโ„Ž-smooth. Proof. See the full proof in page 24. Example 1. The following diagonal conditioners result in a smooth entropic regularizer: 1. Let ๐œ™(๐‘ฅ) = softplus (๐‘ฅ). Then, โ„Žis ๐ฟโ„Ž-smooth with ๐ฟโ„Ž 0.167096. 2. Let ๐œ™(๐‘ฅ) = exp (๐‘ฅ). Then, โ„Žis ๐ฟโ„Ž-smooth for arbitrarily small ๐ฟโ„Ž. This might initially suggest that diagonal conditioners are a promising way of making the ELBO globally smooth. Unfortunately, the properties of the energy, ๐‘“, change unfavorably. 3.3 Smoothness of the Energy Inapplicability of Existing Proof Strategy Previously, Domke (2020, Theorem 1) have proven that the energy is smooth when ๐œ™is linear. The key step was to use Bessel s inequality based on the observation that the partial derivatives of the reparameterization function ๐’ฏform unit bases in expectation. That is, where ๐Ÿ™๐‘–=๐‘—is an indicator function that is 1 only when ๐‘–= ๐‘—and 0 otherwise. Unfortunately, when ๐œ™is nonlinear, the partial derivatives ๐’ฏ๐€(๐™ช)/ ๐œ†๐‘–for ๐‘–= 1, , ๐‘no longer form unit bases: while they are still orthogonal in expectation, the lengths change nonlinearly depending on ๐€. This leaves Bessel s inequality inapplicable. To circumvent this challenge, we establish a replacement for Bessel s inequality: Lemma 2. Let ๐™ƒbe a ๐‘› ๐‘›symmetric random matrix, where it is bounded as ๐™ƒ 2 ๐ฟ< almost surely. Also, let ๐™…be an ๐‘š ๐‘›random matrix such that ๐”ผ๐™… ๐™… 2 < . Then, ๐”ผ๐™… ๐™ƒ๐™… 2 ๐ฟ ๐”ผ๐™… ๐™… 2. Proof. See the full proof in page 24. Remark 1. By assuming that the joint log-likelihood โ„“is smooth and twice-differentiable, we retrieve Theorem 1 of Domke (2020) by setting ๐™…to be the Jacobian of ๐’ฏ, and ๐™ƒto be the Hessian of โ„“under reparameterization. Remark 2. While our reparameterization function s partial derivatives still form orthogonal bases, they need not be unlike Bessel s inequality, Lemma 2 does not require this. This implies that Lemma 2 is a strategy more general than Bessel s inequality. Equipped with Lemma 2, we present our main result on smoothness: Theorem 1. Let โ„“be ๐ฟโ„“-smooth and twice differentiable. Then, the following results hold: (i) If ๐œ™is linear, the energy ๐‘“is ๐ฟโ„“-smooth. (ii) If ๐œ™is 1-Lipschitz, the energy โ„“is (๐ฟโ„“+ ๐ฟ๐‘ )-smooth if and only if Assumption 3 holds. Proof. See the full proof in page 27. Combined with Lemma 1, this directly implies that the overall ELBO is smooth. Corollary 1 (Smoothness of the ELBO). Let โ„“be ๐ฟโ„“-smooth and Assumption 3 hold. Furthermore, let the diagonal conditioner be 1-Lipschitz continuous, and ๐ฟ๐œ™-log-smooth. Then, the ELBO is (๐ฟโ„“+ ๐ฟ๐‘ + ๐ฟ๐œ™)-smooth. The increase of the smoothness constant implies that we need to use a smaller stepsize to guarantee convergence when using a nonlinear ๐œ™. Furthermore, even on simple ๐ฟ-smooth examples Assumption 3 may not hold: Example 2. Let โ„“(๐’›) = (1/2) ๐’› ๐‘จ๐’›and the diagonal conditioner be ๐œ™(๐‘ฅ) = softplus (๐‘ฅ). Then, (i) if ๐‘จis dense and the variational family is the mean-field family or (ii) if ๐‘จis diagonal and the variational family is the Cholesky family, Assumption 3 holds with ๐ฟ๐‘  0.26034 (max๐‘–=1, ,๐‘‘๐ด๐‘–๐‘–). (iii) If ๐‘จis dense but the Cholesky family is used, Assumption 3 does not hold. Proof. See the full proof in page 29. ELBO with ๐œ™(๐‘ฅ) = softplus(๐‘ฅ) ELBO with ๐œ™(๐‘ฅ) = ๐‘ฅ Lower-Bounding Quadratic Tangent Line at ๐‘ 1 = 5 Figure 2: Optimization landscape resulting from different ๐œ™on a strongly-convex โ„“. โ„“is the counter-example of Proposition 1 Item (ii). ๐œ™(๐‘ฅ) = ๐‘ฅpreserves strong convexity as shown by the lower-bounding quadratic (red dotted line ). ๐œ™= softplus violates the first-order condition of convexity (black dotted line ). Example 2 illustrates that establishing the smoothness of the energy becomes non-trivial under nonlinear parameterizations. Even when smoothness does hold, the increased smoothness constant implies that BBVI will be less robust to initialization and stepsizes. Furthermore, in the next section, we will show a much more grave problem: nonlinear parameterizations may affect the convergence rate. 3.4 Convexity of the Energy The convexity of the ELBO under linear parameterizations has first been established by Titsias & Lรกzaro Gredilla (2014, Proposition 1) and Domke (2020, Theorem 9). In particular, Domke (2020) show that, when ๐œ™is linear, if โ„“is ๐œ‡-strongly convex, the energy is also ๐œ‡-strongly convex. However, when using a nonlinear ๐œ™with a co-domain of โ„+, which is the whole point of using a nonlinear conditioner, strong convexity of โ„“never transfers to ๐‘“. Theorem 2. Let โ„“be ๐œ‡-strongly convex. Then, we have the following: (i) If ๐œ™is linear, the energy ๐‘“is ๐œ‡-strongly convex. (ii) If ๐œ™is convex, the energy ๐‘“is convex if and only if Assumption 4 holds. (iii) If ๐œ™is such that ๐œ™ C1 (โ„, โ„+), the energy ๐‘“is not strongly convex. Proof. See the full proof in page 33. The following proposition provides some conditions for Assumption 4 to hold or not hold. Proposition 1. We have the following: (i) If โ„“is convex, then for the mean-field family, Assumption 4 holds. (ii) For the Cholesky family, there exists a convex โ„“where Assumption 4 does not hold. Proof. See the full proof in page 31. For any continuous, differentiable nonlinear conditioner that maps only to non-negative reals, the strong convexity of โ„“does lead to a strongly-convex ELBO. This phenomenon is visualized in Figure 2. The loss surface becomes flat near the optimal scale parameter. This problem becomes more noticeable as the optimal scale becomes smaller. Nonlinear conditioners are suboptimal. As the dataset grows, Bayesian posteriors are known to contract as characterized by the Bernstein-von Mises theorem (van der Vaart, 1998). That is, the posterior variance becomes close to 0. This behavior also applies to misspecified variational posteriors as shown by Wang & Blei (2019). Thus, for large datasets, nonlinear conditioners mostly operate in the regime where they are suboptimal (locally less strongly convex). But linear conditioners result in a non-smooth entropy (Domke, 2020). This dilemma originally motivated Domke to consider proximal SGD, which we analyze in Section 4.2. 4 Convergence Analysis of Black-Box Variational Inference 4.1 Black-Box Variational Inference BBVI with SGD repeats the steps: ๐€๐‘ก+1 = ๐€๐‘ก ๐›พ๐‘ก( ห† ๐‘“(๐€๐‘ก) + โ„Ž(๐€๐‘ก)) , where ห† ๐‘“(๐€๐‘ก) = 1 ๐‘š=1 ๐€โ„“(๐’ฏ๐€(๐™ช๐‘š)) (1) with ๐™ช๐‘š ๐œ‘is the ๐‘€-sample reparameterization gradient estimator and ๐›พ๐‘กis the stepsize. (See Kucukelbir et al., 2017 for algorithmic details.) With our results in Section 3 and the results of Khaled & Richtรกrik (2023) Kim et al. (2023), we obtain a convergence guarantee. To apply the result of Kim et al. (2023), which bounds the gradient variance, we require an additional assumption. Assumption 5. The negative log-likelihood โ„“like(๐’›) log ๐‘(๐’™ ๐’›) is ๐œ‡-quadratically growing for all ๐’› โ„๐‘‘such that ๐œ‡ 2 ๐’› ๐’›like 2 2 โ„“like (๐’›) โ„“ like, where ๐’›like is the projection of ๐’›to the set of minimizers of โ„“like, and โ„“ like = inf๐’› โ„๐‘‘โ„“like (๐’›). This assumption is weaker than assuming that the likelihood satisfies the Polyak-ลojasiewicz inequality (Karimi et al., 2016). Theorem 3. Let Assumption 2 hold, the likelihood satisfy Assumption 5, and the assumptions of Corollary 1 hold such that the ELBO ๐นis ๐ฟ๐น-smooth with ๐ฟ๐น= ๐ฟโ„“+ ๐ฟ๐œ™+ ๐ฟ๐‘ . Then, the iterates generated by BBVI through Equation (1) and the ๐‘€-sample reparameterization gradient include an ๐œ–-stationary point such that min0 ๐‘ก ๐‘‡ 1 ๐”ผ ๐น(๐€๐‘ก) 2 ๐œ–for any ๐œ–> 0 if ๐‘‡ ๐’ช( (๐น(๐€0) ๐น ) 2๐ฟ๐น๐ฟ2 โ„“๐ถ(๐‘‘, ๐‘˜๐œ‘) ๐œ‡๐‘€๐œ–4 ) for some fixed stepsize ๐›พ, where ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family. Proof. See the full proof in page 35. Remark 3. Finding an ๐œ–-stationary point of the ELBO has an iteration complexity of ๐’ช(๐‘‘๐ฟ2 โ„“๐œ…๐‘€ 1๐œ– 4) for the Cholesky family and ๐’ช( ๐‘‘๐ฟ2 โ„“๐œ…๐‘€ 1๐œ– 4) for the mean-field family. 4.2 Black-Box Variational Inference with Proximal SGD Proximal SGD For a composite objective ๐น= ๐‘“+ โ„Ž, proximal SGD repeats the steps: ๐€๐‘ก+1 = prox๐›พ๐‘ก,โ„Ž(๐€๐‘ก ๐›พ๐‘กห† ๐‘“(๐€๐‘ก)) = arg min ๐€ ฮ› [ ห† ๐‘“(๐€๐‘ก) , ๐€ + โ„Ž(๐€) + 1 2๐›พ๐‘ก ๐€ ๐€๐‘ก 2 2 ] , (2) where prox is known as the proximal operator and ๐›พ1, , ๐›พ๐‘‡is a stepsize schedule. In the context of VI, proximal SGD has previously been considered by Altosaar et al. (2018) Diao et al. (2023) Khan et al. (2016, 2015). Their overall focus has been on developing alternative algorithms by generalizing ๐€ ๐€ to other metrics. In contrast, Domke (2020) considered proximal SGD with the regular Euclidean metric ๐€ ๐€ 2 for overcoming the non-smoothness of โ„Žunder linear parameterizations. Here, we prove the convergence of this scheme and show that it retrieves the fastest known convergence rates in stochastic first-order optimization. Proximal Operator for BBVI In our context, โ„Žis the entropy of ๐‘ž๐€in the location-scale family. For this, Domke (2020) show that the the proximal update for ๐‘ 1, , ๐‘ ๐‘‘, is prox๐›พ๐‘ก,โ„Ž(๐‘ ๐‘–) = ๐‘ ๐‘–+ 1 2 ( ๐‘ 2 ๐‘–+ 4๐›พ๐‘ก ๐‘ ๐‘–) . For other parameters, the proximal operator is the regular gradient descent update in Equation (1). Gradient Variance Bound We first establish a bound on the gradient variance. In ERM, contemporary strategies do this by exploiting the finite sum structure of the objective (Section 2.4). Here, we establish a variance bound for RP estimator that does not rely on the finite sum assumption. Lemma 3 (Convex Expected Smoothness). Let โ„“be ๐ฟโ„“-smooth and ๐œ‡-strongly convex with the variational family satisfying Assumption 2 with the linear parameterization. Then, ๐”ผ ๐€๐‘“(๐€; ๐™ช) ๐€ ๐‘“(๐€ ; ๐™ช) 2 2 2๐ฟโ„“๐œ…๐ถ(๐‘‘, ๐œ‘) B๐‘“(๐€, ๐€ ) holds, where B๐‘“(๐€, ๐€ ) ๐‘“(๐€) ๐‘“(๐€ ) ๐‘“(๐€ ) , ๐€ ๐€ is the Bregman divergence, ๐œ…= ๐ฟโ„“/๐œ‡ is the condition number, ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family, and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family. Proof. See the full proof in page 36. Furthermore, the gradient variance at the optimum must be bounded: Lemma 4 (Domke, 2019 Kim et al., 2023). Let โ„“be ๐ฟโ„“-smooth with the variational family satisfying Assumption 2 and a 1-Lipschitz diagonal conditioner ๐œ™. Then, the gradient variance at the optimum ๐€ arg min๐€ ฮ› ๐น(๐€) is bounded as ๐‘€๐ถ(๐‘‘, ๐œ‘) ๐ฟ2 โ„“( ๐’› ๐’Ž 2 2 + ๐‘ช 2 F) , where ๐’›is a stationary point of โ„“, ๐’Ž and ๐‘ช are the location and scale formed by ๐€ , the constants are ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family, ๐‘˜๐œ‘is the kurtosis of ๐œ‘as defined in Assumption 1. Proof. The full-rank case is proven by Domke (2019, Theorem 3), while the mean-field case is a basic corollary of the result by Kim et al. (2023, Lemma 2). Remark 4. The dimensional dependence in the complexity of BBVI is transferred from the variance bound in Lemma 4. Unfortunately, for the Cholesky family, this dimensional dependence in the variance bound is tight (Domke, 2019). Main Result With the gradient variance bounds, we now present our complexity result. The proof is identical to Theorem 3.2 by Gower et al. (2019), where they use a 2-stage decreasing stepsize schedule: the stepsize is initially held constant and then reduced in a 1/๐‘กrate. Theorem 4. Let โ„“be ๐ฟโ„“-smooth and ๐œ‡-strongly convex. Then, for any ๐œ–> 0, BBVI with proximal SGD in Equation (2), the ๐‘€-sample reparameterization gradient estimator, a variational family satisfying Assumption 2 with the linear parameterization guarantees ๐”ผ ๐€๐‘‡ ๐€ 2 2 ๐œ–if 2๐ฟโ„“๐œ…๐ถ(๐‘‘,๐œ‘) for ๐‘ก 4๐‘‡๐œ… 2๐‘ก+1 (๐‘ก+1)2๐œ‡ for ๐‘ก> 4๐‘‡๐œ…, ๐‘‡ max ( 8๐œŽ2 ๐œ‡2 ๐œ–+ 4๐‘‡๐œ… ๐€0 ๐€ 2 where ๐œŽ2 is defined in Lemma 4, ๐‘‡๐œ…= ๐œ…2๐ถ(๐‘‘, ๐œ‘) ๐‘€ 1 , ๐œ…= ๐ฟโ„“/๐œ‡is the condition number, e is Euler s constant, ๐€ = arg min๐€ ฮ› ๐น(๐€), ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family, and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family. Proof. See the full proof in page 38. Remark 5. BBVI with proximal SGD on ๐œ‡-strongly convex and ๐ฟโ„“-smooth โ„“has a complexity ๐’ช(๐œ…2๐‘‘๐‘€ 1 ๐œ– 1) for the Cholesky family and ๐’ช(๐œ…2 ๐‘‘๐‘€ 1 ๐œ– 1) for the mean-field family. Remark 6. We also provide a similar result with a fixed stepsize in Theorem 7 of Appendix F.3.2. In this case, the complexity is ๐’ช(๐œ…2๐‘‘๐‘€ 1๐œ– 1 log ๐œ– 1) for the Cholesky family and ๐’ช(๐œ…2 ๐‘‘๐‘€ 1๐œ– 1 log ๐œ– 1) for the mean-field family. 10 4 10 3 10 2 10 1 101 ๐‘‡for ๐œ–-accuracy Proximal SGD ๐œ™(๐‘ฅ) = ๐‘ฅ SGD ๐œ™(๐‘ฅ) = ๐‘ฅ SGD ๐œ™(๐‘ฅ) = softplus(๐‘ฅ) 10 4 10 3 10 2 10 1 Stepsize (๐›พ) 10 4 10 3 10 2 10 1 Figure 3: Stepsize versus the number of iterations for vanilla SGD and proximal SGD to achieve DKL(๐‘ž๐€, ๐œ‹) ๐œ–= 1 under different initializations for Gaussian posteriors. The initializations ๐ถ(๐€0) are ๐ˆ, 10 3๐ˆ, 10 5๐ˆfrom left to right, respectively. The average suboptimality at iteration ๐‘ก was estimated from 10 independent runs. For each run, the target posterior was a 10-dimensional Gaussian with a covariance with a condition number ๐œ…= 10 and a smoothness of ๐ฟ= 100. LME-election Prox Gen-Adam ฯ•(x) = x Adam ฯ•(x) = x Adam ฯ•(x) = softplus(x) Iteration 1 2.5k 5k 7.5k 10k 1 5k 10k 15k 20k 1 10k 20k 30k 1 10k 20k 30k Iteration 1 10k 20k 30k LR-electric 1 5k 10k 15k 20k Figure 4: Comparison of BBVI convergence speed (ELBO v.s. Iteration) of different optimization algorithms. The error bands are the 80% quantiles estimated from 20 (10 for AR-eeg) independent replications. The results shown used a base stepsize of ๐›พ= 10 3, while the initial point was ๐’Ž0 = ๐ŸŽ, ๐‘ช0 = ๐ˆ. Details on the setup can be found in the text of Section 5.2 and Appendix G. 5 Experiments 5.1 Synthetic Problem Setup We first compare proximal SGD against vanilla SGD with linear and nonlinear parameterizations on a synthetic problem, which is log-smooth, strongly log-concave, and the exact solution is known. While a similar experiment was already conducted by Domke (2020), here we include nonlinear parameterizations, which were not originally considered. We run all algorithms with a fixed stepsize to infer a multivariate Gaussian with a full-rank covariance matrix. The variational approximation is a full-rank Gaussian formed by ๐œ‘= ๐’ฉ(0, 1) and the Cholesky parameterization. Results The results are shown in Figure 3. Proximal SGD is clearly the most robust against initialization. Also, SGD with the nonlinear parameterization ๐œ™(๐‘ฅ) = softplus(๐‘ฅ) is much slower to converge under all initializations. This confirms that linear parameterizations are indeed superior for both robustness against initializations and convergence speed. 5.2 Realistic Problems Setup We now evaluate proximal SGD on realistic problems. In practice, Adam (Kingma & Ba, 2015) is observed to be robust against stepsize choices (Zhang et al., 2019). The reason why Adam performs well on non-smooth, non-convex problems is still under investigation (Kunstner et al., 2023 Reddi et al., 2023 Zhang et al., 2022). Nonetheless, to compare fairly against Adam, we implement a recently proposed variant of proximal SGD called Prox Gen (Yun et al., 2021), which includes an Adam-like update rule. The probabilitic models and datasets are fully described in Appendix G. We implement these models and BBVI on top of the Turing (Ge et al., 2018) probabilistic programming framework. Due to the size of these datasets, we implement doubly stochastic subsampling (Titsias & Lรกzaro-Gredilla, 2014) with a batch size of ๐ต= 100 (๐ต= 500 for BT-tennis) with ๐‘€= 10 Monte Carlo samples. For batch subsampling, we implement random-reshuffling, which is faster than independent subsampling both empirically (Bottou, 2009) and theoretically (Ahn et al., 2020 Haochen & Sra, 2019 Mishchenko et al., 2020 Nagaraj et al., 2019). We also observe that doubly stochastic BBVI benefits from reshuffling, but leave a detailed investigation to future works. Results Representative results are shown in Figure 4, with additional results in Appendix H. Both Prox Gen-Adam and Adam with linear parameterizations converge faster than Adam with nonlinear parameterization. Furthermore, for the case of election and buzz, Adam with the nonlinear parameterization converges much slower than the alternatives. When using linear parameterizations, Prox Gen-Adam appears to be generally faster than Adam. We note, however, that due to the difference in the update rule between Prox Gen-Adam and Adam, proximal operators alone might not fully explain the performance difference. Nevertheless, the results of our experiment do conclusively suggest that linear parameterizations are superior. 6 Discussions Conclusions In this work, we have proven the convergence of BBVI. Our assumptions encompass implementations that are actually used in practice, and our theoretical analysis revealed limitations in some of the popular design choices (mainly the use of nonlinear conditioners). To resolve this issue, we re-evaluated the utility of proximal SGD both theoretically and practically, where it achieved the strongest theoretical guarantees in stochastic first-order optimization. Related Works To prove the convergence of BBVI, early works have a-priori assumed the regularity of the ELBO and the gradient estimator (Alquier & Ridgway, 2020 Buchholz et al., 2018 Khan et al., 2016, 2015 Liu & Owen, 2021 Regier et al., 2017). Towards a more rigorous understanding, Domke (2019) Fan et al. (2015) Kim et al. (2023) Xu et al. (2019) studied the reparameterization gradient, Xu & Campbell (2022) studied the asymptotics of the ELBO, Challis & Barber (2013) Domke (2020) Titsias & Lรกzaro-Gredilla (2014) established convexity, and Domke (2020) established smoothness. On the other hand, Bhatia et al. (2022) Hoffman & Ma (2020) established rigorous convergence guarantees by considering simplified variant of BBVI where only the scale is optimized, and Fujisawa & Sato (2021) assumed that the support of ๐œ‘is bounded almost surely. Meanwhile, under similar assumptions to ours, Diao et al. (2023) Lambert et al. (2022) recently established convergence guarantees for proximal SGD BBVI with a Bures-Wasserstein metric. Their computational properties differ from BBVI as they require Hessian evaluations. Also, understanding BBVI, which is VI with a Euclidean metric, is an important problem due to its practical relevance. Limitations Our work has multiple limitations: (i) Our results are restricted to the location-scale family, (ii) the reparameterization gradient, and (iii) smooth joint log-likelihoods. However, the location-scale family with the reparameterization gradient is the most widely used combination in practice, and replacing the smoothness assumption is an active area of research in stochastic optimization. For our results on proximal SGD, we further assume that the joint log-likelihood is ๐œ‡-strongly convex (equivalently strongly log-concave posteriors). It is unclear how to extend the guarantees to only smooth but non-log-concave joint log-likelihoods. Open Problems Although we have proven that the mean-field dimensional family has a dimension dependence of ๐’ช( ๐‘‘), empirical results suggest room for improvement (Kim et al., 2023). Therefore, we pose the following conjecture: Conjecture 1. Under mild assumptions, BBVI for the mean-field variational family converges with only logarithmic dimensional dependence or no explicit dimensional dependence at all. This would put mean-field BBVI in a regime clearly faster than approximate MCMC (Freund et al., 2022). Also, it is unknown whether the ๐’ช(๐œ…2) condition number dependence dependence is tight. In fact, for proximal SGD BBVI in Bures-Wasserstien space, Diao et al. (2023) report a dependence of ๐’ช(๐œ…). Lastly, it would be interesting to see whether natural gradient VI (NGVI Amari, 1998 Khan & Lin, 2017) can achieve similar convergence guarantees. While it is empirically known that NGVI often converges faster (Lin et al., 2019), theoretical evidence has yet to follow. Acknowledgments and Disclosure of Funding The authors would like to thank Justin Domke for discussions on the concurrent results, Javier Burroni for pointing out a mistake in the earlier version of this work, and the anonymous reviewers for their constructive comments. K. Kim and J. R. Gardner were funded by the National Science Foundation Award [IIS-2145644], while Y.-A. Ma was funded by the National Science Foundation Grants [NSF-SCALE Mo DL2134209] and [NSF-CCF-2112665 (TILOS)], the U.S. Department Of Energy, Office of Science, and the Facebook Research award. Agrawal, Abhinav, & Domke, Justin. 2021. Amortized Variational Inference for Simple Hierarchical Models. Pages 21388 21399 of: Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc. (page 2) Agrawal, Abhinav, Sheldon, Daniel R, & Domke, Justin. 2020. Advances in Black-Box VI: Normalizing Flows, Importance Weighting, and Optimization. Pages 17358 17369 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 1) Ahn, Kwangjun, Yun, Chulhee, & Sra, Suvrit. 2020. SGD with Shuffling: Optimal Rates without Component Convexity and Large Epoch Requirements. Pages 17526 17535 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 10) Alquier, Pierre, & Ridgway, James. 2020. Concentration of Tempered Posteriors and of Their Variational Approximations. The Annals of Statistics, 48(3), 1475 1497. (page 10) Altosaar, Jaan, Ranganath, Rajesh, & Blei, David. 2018. Proximity Variational Inference. Pages 1961 1969 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 84. JMLR. (page 7) Amari, Shun-ichi. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251 276. (page 10) Bertin-Mahieux, Thierry, Ellis, Daniel P.W., Whitman, Brian, & Lamere, Paul. 2011. The Million Song Dataset. In: Proceedings of the International Conference on Music Information. (page 40) Bhatia, Kush, Kuang, Nikki Lijing, Ma, Yi-An, & Wang, Yixin. 2022 (July). Statistical and Computational Trade-Offs in Variational Inference: A Case Study in Inferential Model Selection. ar Xiv Preprint ar Xiv:2207.11208. ar Xiv. (pages 1, 10) Bingham, Eli, Chen, Jonathan P., Jankowiak, Martin, Obermeyer, Fritz, Pradhan, Neeraj, Karaletsos, Theofanis, Singh, Rohit, Szerlip, Paul, Horsfall, Paul, & Goodman, Noah D. 2019. Pyro: Deep Universal Probabilistic Programming. Journal of Machine Learning Research, 20(28), 1 6. (page 1) Blei, David M., Kucukelbir, Alp, & Mc Auliffe, Jon D. 2017. Variational Inference: A Review for Statisticians. Journal of the American Statistical Association, 112(518), 859 877. (pages 2, 4) Bottou, Lรฉon. 1999. On-Line Learning and Stochastic Approximations. Pages 9 42 of: On-Line Learning in Neural Networks, first edn. Cambridge University Press. (page 2) Bottou, Lรฉon. 2009. Curiously Fast Convergence of Some Stochastic Gradient Descent Algorithms. (page 10) Buchholz, Alexander, Wenzel, Florian, & Mandt, Stephan. 2018. Quasi-Monte Carlo Variational Inference. Pages 668 677 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 80. JMLR. (pages 4, 10) Carpenter, Bob, Gelman, Andrew, Hoffman, Matthew D., Lee, Daniel, Goodrich, Ben, Betancourt, Michael, Brubaker, Marcus, Guo, Jiqiang, Li, Peter, & Riddell, Allen. 2017. Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1). (page 1) Carvalho, Carlos M., Polson, Nicholas G., & Scott, James G. 2009. Handling Sparsity via the Horseshoe. Pages 73 80 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 5. JMLR. (page 41) Carvalho, Carlos M., Polson, Nicholas G., & Scott, James G. 2010. The Horseshoe Estimator for Sparse Signals. Biometrika, 97(2), 465 480. (page 41) Challis, Edward, & Barber, David. 2013. Gaussian Kullback-Leibler Approximate Inference. Journal of Machine Learning Research, 14(68), 2239 2286. (pages 1, 5, 10) Christmas, Jacqueline, & Everson, Richard. 2011. Robust Autoregression: Student-T Innovations Using Variational Bayes. IEEE Transactions on Signal Processing, 59(1), 48 57. (page 41) Dhaka, Akash Kumar, Catalina, Alejandro, Andersen, Michael R, ns Magnusson, Mรฅ, Huggins, Jonathan, & Vehtari, Aki. 2020. Robust, Accurate Stochastic Optimization for Variational Inference. Pages 10961 10973 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 1) Dhaka, Akash Kumar, Catalina, Alejandro, Welandawe, Manushi, Andersen, Michael R., Huggins, Jonathan, & Vehtari, Aki. 2021. Challenges and Opportunities in High Dimensional Variational Inference. Pages 7787 7798 of: Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc. (page 1) Diao, Michael Ziyang, Balasubramanian, Krishna, Chewi, Sinho, & Salim, Adil. 2023. Forward Backward Gaussian Variational Inference via JKO in the Bures-Wasserstein Space. Pages 7960 7991 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 202. JMLR. (pages 7, 10) Dieng, Adji Bousso, Tran, Dustin, Ranganath, Rajesh, Paisley, John, & Blei, David. 2017. Variational Inference via ๐œ’Upper Bound Minimization. Pages 2729 2738 of: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (page 2) Dillon, Joshua V., Langmore, Ian, Tran, Dustin, Brevdo, Eugene, Vasudevan, Srinivas, Moore, Dave, Patton, Brian, Alemi, Alex, Hoffman, Matt, & Saurous, Rif A. 2017 (Nov.). Tensor Flow Distributions. ar Xiv Preprint ar Xiv:1711.10604. ar Xiv. (pages 1, 3) Domke, Justin. 2019. Provable Gradient Variance Guarantees for Black-Box Variational Inference. Pages 329 338 of: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (pages 1, 3, 4, 8, 10, 21, 22, 36) Domke, Justin. 2020. Provable Smoothness Guarantees for Black-Box Variational Inference. Pages 2587 2596 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 119. JMLR. (pages 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 25, 27, 33) Domke, Justin, Garrigos, Guillaume, & Gower, Robert. 2023. Provable Convergence Guarantees for Black-Box Variational Inference. In: Advances in Neural Information Processing Systems (to Appear). New Orleans, LA, USA: ar Xiv. (pages 2, 17, 21) Dua, Dheeru, & Graff, Casey. 2017. UCI Machine Learning Repository. (page 40) Duchi, John, Hazan, Elad, & Singer, Yoram. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul), 2121 2159. (page 20) Dugas, Charles, Bengio, Yoshua, Bรฉlisle, Franรงois, Nadeau, Claude, & Garcia, Renรฉ. 2000. Incorporating Second-Order Functional Knowledge for Better Option Pricing. In: Advances in Neural Information Processing Systems, vol. 13. MIT Press. (page 4) Dwivedi, Raaz, Chen, Yuansi, Wainwright, Martin J., & Yu, Bin. 2019. Log-Concave Sampling: Metropolis-Hastings Algorithms Are Fast. Journal of Machine Learning Research, 20(183), 1 42. (page 2) Fan, Kai, Wang, Ziteng, Beck, Jeff, Kwok, James, & Heller, Katherine A. 2015. Fast Second Order Stochastic Backpropagation for Variational Inference. Pages 1387 1395 of: Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (page 10) Fjelde, Tor Erlend, Xu, Kai, Tarek, Mohamed, Yalburgi, Sharan, & Ge, Hong. 2020. Bijectors.jl: Flexible Transformations for Probability Distributions. Pages 1 17 of: Proceedings of The Symposium on Advances in Approximate Bayesian Inference. PMLR, vol. 118. JMLR. (page 3) Freund, Yoav, Ma, Yi-An, & Zhang, Tong. 2022. When Is the Convergence Time of Langevin Algorithms Dimension Independent? A Composite Optimization Viewpoint. Journal of Machine Learning Research, 23(214), 1 32. (page 10) Fujisawa, Masahiro, & Sato, Issei. 2021. Multilevel Monte Carlo Variational Inference. Journal of Machine Learning Research, 22(278), 1 44. (pages 3, 4, 10) Garrigos, Guillaume, & Gower, Robert M. 2023 (Feb.). Handbook of Convergence Theorems for (Stochastic) Gradient Methods. ar Xiv Preprint ar Xiv:2301.11235. ar Xiv. (pages 2, 4, 21, 37, 38) Ge, Hong, Xu, Kai, & Ghahramani, Zoubin. 2018. Turing: A Language for Flexible Probabilistic Inference. Pages 1682 1690 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 84. JMLR. (pages 1, 10) Gelman, Andrew, & Hill, Jennifer. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge New York: Cambridge University Press. (page 40) Giordano, Ryan, Broderick, Tamara, & Jordan, Michael I. 2018. Covariances, Robustness, and Variational Bayes. Journal of Machine Learning Research, 19(51), 1 49. (page 1) Giordano, Ryan, Ingram, Martin, & Broderick, Tamara. 2023 (Apr.). Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box. ar Xiv Preprint ar Xiv:2304.05527. ar Xiv. (page 41) Goldberger, Ary L., Amaral, Luis A. N., Glass, Leon, Hausdorff, Jeffrey M., Ivanov, Plamen Ch., Mark, Roger G., Mietus, Joseph E., Moody, George B., Peng, Chung-Kang, & Stanley, H. Eugene. 2000. Physio Bank, Physio Toolkit, and Physio Net: Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101(23). (page 41) Gorbunov, Eduard, Hanzely, Filip, & Richtarik, Peter. 2020. A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent. Pages 680 690 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 108. JMLR. (pages 21, 37) Gower, Robert Mansel, Loizou, Nicolas, Qian, Xun, Sailanbayev, Alibek, Shulgin, Egor, & Richtรกrik, Peter. 2019. SGD: General Analysis and Improved Rates. Pages 5200 5209 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 97. JMLR. (pages 4, 8, 21, 38) Haochen, Jeff, & Sra, Suvrit. 2019. Random Shuffling Beats SGD after Finite Epochs. Pages 2624 2633 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 97. JMLR. (page 10) Hernandez-Lobato, Jose, Li, Yingzhen, Rowland, Mark, Bui, Thang, Hernandez-Lobato, Daniel, & Turner, Richard. 2016. Black-Box Alpha Divergence Minimization. Pages 1511 1520 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 48. JMLR. (page 2) Hoffman, Matthew, & Ma, Yian. 2020. Black-Box Variational Inference as a Parametric Approximation to Langevin Dynamics. Pages 4324 4341 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 119. JMLR. (pages 1, 10) Jager, F., Taddei, A., Moody, G. B., Emdin, M., Antoliฤ, G., Dorn, R., Smrdel, A., Marchesi, C., & Mark, R. G. 2003. Long-Term ST Database: A Reference for the Development and Evaluation of Automated Ischaemia Detectors and for the Study of the Dynamics of Myocardial Ischaemia. Medical and Biological Engineering and Computing, 41(2), 172 182. (pages 40, 41) Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Tommi S., & Saul, Lawrence K. 1999. An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2), 183 233. (pages 2, 3) Karimi, Hamed, Nutini, Julie, & Schmidt, Mark. 2016. Linear Convergence of Gradient and Proximal-Gradient Methods under the Polyak-ลojasiewicz Condition. Pages 795 811 of: Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. Cham: Springer International Publishing. (page 7) Kawala, Franรงois, Douzal-Chouakria, Ahlame, Gaussier, Eric, & Diemert, Eustache. 2013. Prรฉdictions d activitรฉ Dans Les Rรฉseaux Sociaux En Ligne. Page 16 of: Actes de La Confรฉrence Sur Les Modรจles et L Analyse Des Rรฉseaux : Approches Mathรฉmatiques et Informatique. (page 40) Khaled, Ahmed, & Richtรกrik, Peter. 2023. Better Theory for SGD in the Nonconvex World. Transactions of Machine Learning Research. (pages 7, 21, 34, 35) Khan, Mohammad, & Lin, Wu. 2017. Conjugate-Computation Variational Inference: Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models. Pages 878 887 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 54. JMLR. (page 10) Khan, Mohammad Emtiyaz, Babanezhad, Reza, Lin, Wu, Schmidt, Mark, & Sugiyama, Masashi. 2016. Faster Stochastic Variational Inference Using Proximal-Gradient Methods with General Divergence Functions. Pages 319 328 of: Proceedings of the Conference on Uncertainty in Artificial Intelligence. UAI 16. Arlington, Virginia, USA: AUAI Press. (pages 7, 10) Khan, Mohammad Emtiyaz E, Baque, Pierre, Fleuret, Franรงois, & Fua, Pascal. 2015. Kullback Leibler Proximal Variational Inference. In: Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (pages 7, 10) Kim, Kyurae, Oh, Jisu, Gardner, Jacob, Dieng, Adji Bousso, & Kim, Hongseok. 2022. Markov Chain Score Ascent: A Unifying Framework of Variational Inference with Markovian Gradients. Pages 34802 34816 of: Advances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc. (page 2) Kim, Kyurae, Wu, Kaiwen, Oh, Jisu, & Gardner, Jacob R. 2023. Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian Inference. Pages 16853 16876 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 202. Honolulu, HI, USA: JMLR. (pages 1, 3, 4, 7, 8, 10, 21, 22, 34, 35, 36) Kingma, Diederik P., & Ba, Jimmy. 2015. Adam: A Method for Stochastic Optimization. In: Proceedings of the International Conference on Learning Representations. (pages 2, 9, 20) Kingma, Diederik P., & Welling, Max. 2014 (Apr.). Auto-Encoding Variational Bayes. In: Proceedings of the International Conference on Learning Representations. (page 2) Kucukelbir, Alp, Tran, Dustin, Ranganath, Rajesh, Gelman, Andrew, & Blei, David M. 2017. Automatic Differentiation Variational Inference. Journal of Machine Learning Research, 18(14), 1 45. (pages 1, 3, 7, 21) Kunstner, Frederik, Chen, Jacques, Lavington, Jonathan Wilder, & Schmidt, Mark. 2023 (Feb.). Noise Is Not the Main Factor behind the Gap between Sgd and Adam on Transformers, but Sign Descent Might Be. In: Proceedings of the International Conference on Learning Representations. (page 9) Lambert, Marc, Chewi, Sinho, Bach, Francis, Bonnabel, Silvรจre, & Rigollet, Philippe. 2022. Variational Inference via Wasserstein Gradient Flows. Pages 14434 14447 of: Advances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc. (page 10) Leger, Jean-Benoist. 2023 (Jan.). Parametrization Cookbook: A Set of Bijective Parametrizations for Using Machine Learning Methods in Statistical Inference. ar Xiv Preprint ar Xiv:2301.08297. ar Xiv. (page 3) Lin, Wu, Khan, Mohammad Emtiyaz, & Schmidt, Mark. 2019. Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-Family Approximations. Pages 3992 4002 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 97. JMLR. (page 10) Liu, Sifan, & Owen, Art B. 2021. Quasi-Monte Carlo Quasi-Newton in Variational Bayes. Journal of Machine Learning Research, 22(243), 1 23. (pages 4, 10) Magnusson, Mรฅns, Bรผrkner, Paul, & Vehtari, Aki. 2022 (Nov.). Posteriordb: A Set of Posteriors for Bayesian Inference and Probabilistic Programming. (page 40) Mishchenko, Konstantin, Khaled, Ahmed, & Richtarik, Peter. 2020. Random Reshuffling: Simple Analysis with Vast Improvements. Pages 17309 17320 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 10) Naesseth, Christian, Lindsten, Fredrik, & Blei, David. 2020. Markovian Score Climbing: Variational Inference with KL(p||q). Pages 15499 15510 of: Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc. (page 2) Nagaraj, Dheeraj, Jain, Prateek, & Netrapalli, Praneeth. 2019. SGD without Replacement: Sharper Rates for General Smooth Convex Functions. Pages 4703 4711 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 97. JMLR. (page 10) Nemirovski, A., Juditsky, A., Lan, G., & Shapiro, A. 2009. Robust Stochastic Approximation Approach to Stochastic Programming. SIAM Journal on Optimization, 19(4), 1574 1609. (page 2) Nguyen, Lam, Nguyen, Phuong Ha, van Dijk, Marten, Richtarik, Peter, Scheinberg, Katya, & Takac, Martin. 2018. SGD and Hogwild! Convergence without the Bounded Gradients Assumption. Pages 3750 3758 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 80. JMLR. (page 4) Patil, Anand, Huard, David, & Fonnesbeck, Christopher. 2010. Py MC: Bayesian Stochastic Modelling in Python. Journal of Statistical Software, 35(4). (page 1) Ranganath, Rajesh, Gerrish, Sean, & Blei, David. 2014. Black Box Variational Inference. Pages 814 822 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 33. JMLR. (page 1) Reddi, Sashank J., Kale, Satyen, & Kumar, Sanjiv. 2023 (May). On the Convergence of Adam and Beyond. In: Proceedings of the International Conference on Learning Representations. (page 9) Regier, Jeffrey, Jordan, Michael I, & Mc Auliffe, Jon. 2017. Fast Black-Box Variational Inference through Stochastic Trust-Region Optimization. Pages 2399 2408 of: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (page 10) Robbins, Herbert, & Monro, Sutton. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400 407. (page 2) Roeder, Geoffrey, Wu, Yuhuai, & Duvenaud, David K. 2017. Sticking the Landing: Simple, Lower Variance Gradient Estimators for Variational Inference. Pages 6928 6937 of: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (pages 2, 21) Shannon, Paul, Markiel, Andrew, Ozier, Owen, Baliga, Nitin S., Wang, Jonathan T., Ramage, Daniel, Amin, Nada, Schwikowski, Benno, & Ideker, Trey. 2003. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Research, 13(11), 2498 2504. (page 40) Titsias, Michalis. 2009. Variational Learning of Inducing Variables in Sparse Gaussian Processes. Pages 567 574 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 5. JMLR. (page 4) Titsias, Michalis, & Lรกzaro-Gredilla, Miguel. 2014. Doubly Stochastic Variational Bayes for Non Conjugate Inference. Pages 1971 1979 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 32. JMLR. (pages 1, 2, 3, 5, 6, 10, 21) van der Vaart, A. W. 1998. Asymptotic Statistics. First edn. Cambridge University Press. (page 7) Vaswani, Sharan, Bach, Francis, & Schmidt, Mark. 2019. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron. Pages 1195 1204 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 89. JMLR. (page 4) Wang, Yixin, & Blei, David. 2019. Variational Bayes under Model Misspecification. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (page 7) Welandawe, Manushi, Andersen, Michael Riis, Vehtari, Aki, & Huggins, Jonathan H. 2022 (Mar.). Robust, Automated, and Accurate Black-Box Variational Inference. ar Xiv Preprint ar Xiv:2203.15945. ar Xiv. (page 1) Wright, Stephen J., & Recht, Benjamin. 2021. Optimization for Data Analysis. New York: Cambridge University Press. (page 21) Xu, Ming, Quiroz, Matias, Kohn, Robert, & Sisson, Scott A. 2019. Variance Reduction Properties of the Reparameterization Trick. Pages 2711 2720 of: Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, vol. 89. JMLR. (page 10) Xu, Zuheng, & Campbell, Trevor. 2022. The Computational Asymptotics of Gaussian Variational Inference and the Laplace Approximation. Statistics and Computing, 32(4), 63. (page 10) Yao, Yuling, Vehtari, Aki, Simpson, Daniel, & Gelman, Andrew. 2018. Yes, but Did It Work?: Evaluating Variational Inference. Pages 5581 5590 of: Proceedings of the International Conference on Machine Learning. PMLR, vol. 80. JMLR. (page 1) Yun, Jihun, Lozano, Aurelie C, & Yang, Eunho. 2021. Adaptive Proximal Gradient Methods for Structured Neural Networks. Pages 24365 24378 of: Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc. (pages 2, 9, 20) Zhang, Cheng, Butepage, Judith, Kjellstrom, Hedvig, & Mandt, Stephan. 2019. Advances in Variational Inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 2008 2026. (pages 2, 9) Zhang, Yushun, Chen, Congliang, Shi, Naichen, Sun, Ruoyu, & Luo, Zhi-Quan. 2022. Adam Can Converge without Any Modification on Update Rules. Pages 28386 28399 of: Advances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc. (page 9) On the Convergence of Black-Box Variational Inference Appendix Table of Contents 1 Introduction 1 2 Background 2 2.1 Black-Box Variational Inference . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Variational Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Scale Parameterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4 Problem Structure of Black-Box Variational Inference . . . . . . . . . . . 4 3 The Evidence Lower Bound Under Nonlinear Scale Parameterizations 5 3.1 Technical Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2 Smoothness of the Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Smoothness of the Energy . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 Convexity of the Energy . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Convergence Analysis of Black-Box Variational Inference 7 4.1 Black-Box Variational Inference . . . . . . . . . . . . . . . . . . . . . . 7 4.2 Black-Box Variational Inference with Proximal SGD . . . . . . . . . . . 7 5 Experiments 9 5.1 Synthetic Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2 Realistic Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6 Discussions 10 A Computational Resources 18 B Nomenclature 18 C Definitions 19 D Prox Gen Adam for Black-Box Variational Inference 20 E Detailed Comparison Against Domke et al. (2023) 21 F Proofs 22 F.1 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 F.2 Properties of the Evidence Lower Bound . . . . . . . . . . . . . . . . . . 24 F.2.1 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 F.2.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 F.3 Convergence of Black-Box Variational Inference . . . . . . . . . . . . . 34 F.3.1 Vanilla Black-Box Variational Inference . . . . . . . . . . . . . . 34 F.3.2 Proximal Black-Box Variational Inference . . . . . . . . . . . . . 36 G Details of Experimental Setup 40 H Additional Experimental Results 42 A Computational Resources Table 1: Computational Resources Type Model and Specifications System Topology 2 nodes with 2 sockets each with 24 logical threads (total 48 threads) Processor 1 Intel Xeon Silver 4310, 2.1 GHz (maximum 3.3 GHz) per socket Cache 1.1 Mi B L1, 30 Mi B L2, and 36 Mi B L3 Memory 250 Gi B RAM Accelerator 1 NVIDIA RTX A5000 per node, 2 GHZ, 24GB RAM Running the experiments took approximately a week. B Nomenclature Symbol Definition Description Section ๐€ Variational parameters 2.1 ๐’› Parameters of the target model ๐œ‹ 2.1 ๐’ฏ๐€ ๐‘ช(๐€) ๐’–+ ๐’Ž location-scale reparameterization function 2.2 ๐™ช Random vector before reparameterization 2.2 ๐œ‘ Base distribution of ๐™ช 2.2 ๐’Ž Location parameter (part of ๐€) 2.2 ๐‘ช Scale parameter (part of ๐€) 2.2 and 2.3 ๐œ™ Diagonal conditioner 2.3 ๐‘˜๐œ‘ Kurtosis (non-central 4th moment) of ๐™ช 2.3 ๐‘ซ๐œ™(๐’”) Diagonal of ๐‘ชusing the diagonal conditioner ๐œ™ 2.3 ๐‘ณ Strictly lower triangular part of ๐‘ช 2.3 ๐’” Elements forming the diagonal of ๐‘ช 2.3 โ„“(๐’›) log ๐‘(๐’›, ๐’™) Negative joint likelihood 2.4 ๐‘“(๐€) ๐”ผ๐™ฏ ๐‘ž๐€โ„“(๐™ฏ) Energy 2.4 โ„Ž(๐€) โ„(๐‘ž๐€) Negative entropy 2.4 ๐น(๐€) ๐‘“(๐€) + โ„Ž(๐€) Negative ELBO 2.1 and 2.4 ๐‘“(๐€; ๐’–) โ„“(๐’ฏ๐€(๐’–)) Negative Log-Likelihood under reparameterization 2.4 ๐‘”(๐€; ๐’–) โ„“(๐’ฏ๐€(๐’–)) Gradient of the Log-likelihood under reparameterization 2.4 ๐‘€ Number of Monte Carlo samples 2.1 ๐›พ๐‘ก Stepsize of (proximal) SGD at iteration ๐‘ก 4.1 and 4.2 ห† ๐‘“ Reparameterization gradient estimator of the energy 4.1 C Definitions For completeness, we provide formal definitions for some of the terms we used throughout the paper. Definition 5 (Smoothness). A function ๐‘“ ๐’ต โ„is said to be ๐ฟ-smooth if the inequality ๐‘“(๐’›) ๐‘“(๐’› ) ๐’› ๐’› holds for all ๐’›, ๐’› ๐’ต. This assumption, also occasionally called Lipschitz smoothness, restricts the amount the gradient can change for a given distance. When ๐‘“is twice differentiable, an equivalent condition is the Hessian to be bounded: Definition 6 (Smoothness). A twice differentiable function ๐‘“ ๐’ต โ„is said to be ๐ฟ-smooth if the inequality 2๐‘“(๐’›) ๐ฟ holds for all ๐’› ๐’ต. Remark 7. Assuming a function ๐‘“is smooth is equivalent to assuming that ๐‘“can be upper bounded by a quadratic function everywhere. Remark 8. When the log-density log ๐œ‹of a probability measure ฮ  is ๐ฟ-smooth, log ๐œ‹can be upper bounded everywhere by the log-density of a Gaussian. Definition 7 (Strong Convexity). A twice differentiable function ๐‘“ โ„๐‘‘ โ„is said to be ๐œ‡-strongly convex if the inequality ๐œ‡ 2 ๐’› ๐’› 2 + ๐‘“(๐’›) , ๐’› ๐’› + ๐‘“(๐’›) ๐‘“(๐’› ) holds for all ๐’›, ๐’› โ„๐‘‘and some ๐œ‡> 0. Remark 9. If Definition 7 holds only for ๐œ‡= 0, ๐‘“is said to be (non-strongly) convex. Remark 10. Assuming a function ๐‘“is strongly convex is equivalent to assuming that ๐‘“can be lower bounded by a quadratic. Definition 8 (Strongly Log-Concave Measures). For a probability measure ฮ  in a Euclidean measurable space (โ„๐‘‘, โ„ฌ(โ„๐‘‘) , โ„™), where โ„ฌ(โ„๐‘‘) is the ๐œŽ-algebra of Borel-measurable subsets of โ„๐‘‘, โ„™is the Lebesgue measure, we say ฮ  is ๐œ‡-strongly log-concave if its log-density log ๐œ‹(๐’›) โ„๐‘‘ โ„is ๐œ‡-strongly convex for some ๐œ‡> 0. Remark 11. If Definition 8 holds only for ๐œ‡= 0, ฮ  is said to be (non-strongly) log-concave. Remark 12. When ฮ  is ๐œ‡-strongly log-concave, log ๐œ‹can be lower bounded everywhere by the log-density of a Gaussian. D Prox Gen Adam for Black-Box Variational Inference Algorithm 1: Prox Gen-Adam for Black-Box Variational Inference Input: Initial variational parameters ๐€0, base stepsize ๐›ผ, second moment stepsize ๐›ฝ2, momentum stepsize {๐›ฝ1,๐‘ก}๐‘‡ ๐‘ก=1, small positve constant ๐œ– for ๐‘ก= 1, , ๐‘‡do estimate gradient of energy ห† ๐‘“ ๐’ˆ๐‘ก= ห† ๐‘“(๐€) + โ„Ž(๐€) ๐€๐‘ก+1 = ๐›ฝ1,๐‘ก๐€๐‘ก+ (1 ๐›ฝ1,๐‘ก) ๐€๐‘ก ๐’—๐‘ก+1 = ๐›ฝ2๐’—๐‘ก+ (1 ๐›ฝ2) ๐’ˆ2 ๐‘ก ๐œž๐‘ก+1 = diag (๐›ผ/ ( ๐’—๐‘ก+1 + ๐œ–)) ๐€๐‘ก+1 = ๐€๐‘ก ๐œž๐‘ก+1๐€๐‘ก+1 ๐’”๐‘ก+1 getscale (๐€๐‘ก+1) ๐’”๐‘ก+1 ๐’”๐‘ก+1 + 1 2 ( ๐’”2 ๐‘ก+1 + 4๐œธ๐’”,๐‘ก+1 ๐’”๐‘ก+1) ๐€๐‘ก+1 setscale (๐€๐‘ก+1, ๐’”๐‘ก+1) end (By convention, all vector operations are elementwise.) Adaptive and matrix-valued stepsize-variants of SGD such as Adam (Kingma & Ba, 2015), Ada Grad (Duchi et al., 2011) are widely used. The matrix stepsize of Adam at iteration ๐‘กis given as ๐œž๐‘ก+1 = diag (๐›ผ/ ( ๐’—๐‘ก+1 + ๐œ–)) , where ๐’—๐‘กis the exponential moving average of the second moment, ๐›ผis the base stepsize. Furthermore, the matrix stepsize is applied to the moving average of the gradients, a scheme often called the (heavy-ball) momentum, denoted here as ๐€๐‘ก. Recently, Yun et al. (2021) have proven the convergence for these adaptive, momentum, and matrixvalued stepsize-based SGD methods with proximal steps. Then, the proximal operator is applied as prox๐œž๐‘ก,โ„Ž(๐€๐‘ก ๐œž๐‘ก๐€๐‘ก) = arg min ๐€ { ๐€๐‘ก, ๐€ + โ„Ž(๐€) + 1 2(๐€ ๐€๐‘ก) ๐œž 1 ๐‘ก (๐€ ๐€๐‘ก) } . For Adam, the matrix-valued stepsize is a diagonal matrix. Thus, the proximal operator of Domke (2020) for each ๐‘ ๐‘–forms independent 1-dimensional quadratic problems. Thus, the proximal step is given in the closed-form prox๐œž๐‘ก,โ„Ž(๐‘ ๐‘–) = ๐‘ ๐‘–+ 1 2 ( ๐‘ 2 ๐‘–+ 4๐›พ๐‘ ๐‘– ๐‘ ๐‘–) , where, dropping the index ๐‘กfor clarity, ๐‘ ๐‘–is the element of ๐€๐‘กcorresponding to ๐‘ ๐‘–, ๐›พ๐‘ ๐‘–denotes the stepsize of ๐‘ ๐‘–(a diagonal element of ๐œž๐‘ก). Combined with the Adam-like stepsize rule, the algorithm is shown in Algorithm 1. Difference with Adam In Algorithm 1, we can see the differences with vanilla Adam. Notably, Prox Gen Adam does not perform bias correction of the estimated moments. Furthermore, while some implementations of Adam decay ๐›ฝ1, we keep it constant. It is possible that these differences could result in a different behavior from vanilla Adam. However, in this work, we follow the original implementation by Yun et al. (2021) as closely as possible and leave the comparison with vanilla Adam to future works. E Detailed Comparison Against Domke et al. (2023) In this section, we contrast our results against those of Domke et al. (2023). First, the main challenge to establishing a convergence guarantee for BBVI has been on bounding the gradient variance. In particular, Domke (2019) proved that the variance of the reparameterization gradient for the energy, ห† ๐‘“, is bounded as 2 ๐›ผ ๐€ ๐€ 2 2 + ๐›ฝ (3) for some finite positive constants ๐›ผ, ๐›ฝdepending on the problem constants ๐‘‘, ๐ฟ, ๐‘˜๐œ‘. Domke et al. (2023) call a gradient estimator satisfying this bound to be a quadratic variance estimator. Furthermore, they prove that the closed-form entropy (CFE Kucukelbir et al., 2017 Titsias & Lรกzaro Gredilla, 2014) estimator: ห† ๐นCFE (๐€) ห† ๐‘“(๐€) + โ„Ž(๐€) and the STL estimator by Roeder et al. (2017): ห† ๐นSTL (๐€) 1 ๐‘š=1 ๐€โ„“(๐’ฏ๐€(๐™ช๐‘š)) + ๐€log ๐‘ž๐€(๐’ฏ๐‚(๐™ช๐‘š)) ||๐‚=๐€, where ๐™ช ๐œ‘, also qualify as quadratic variance estimators. Unfortunately, it has been unknown whether SGD is guaranteed to converge with a quadratic variance estimator except for strongly convex objectives (Wright & Recht, 2021, p. 85). Domke et al. (2023) expand the boundaries of SGD and prove that projected and proximal SGD with a quadratic variance estimator converges for both convex and strongly convex objectives. In particular, for the location-scale variational family, the linear parameterization, and log-concave objectives, they prove a complexity of ๐’ช(1/๐œ–2), and for strongly log-concave objectives, they prove a complexity of ๐’ช(1/๐œ–). On the other hand, Kim et al. (2023) and Lemma 12 developed the bound in Equation (3) to be of the form of 2 ๐ด(๐น(๐€) ๐น ) + ๐น(๐€) 2 2 + ๐ถ (4) for some positive finite constants ๐ด, ๐ต, ๐ถ, for which the convergence of SGD for convex, strongly convex (Garrigos & Gower, 2023 Gorbunov et al., 2020), and non-convex objectives (Khaled & Richtรกrik, 2023) have already been proven. Applying these results to log-smooth and logquadratically growing objectives, we prove a complexity of ๐’ช(1/๐œ–4), while for strong log-concave objectives, we also prove a complexity of ๐’ช(1/๐œ–). Overall, both approaches can be summarized as follows: we focused on establishing gradient variance bounds of known convergence proofs, while Domke et al. (2023) aimed to prove that the bound by Domke (2019) is sufficient to guarantee convergence. Note that, for strongly log-concave objectives, Equation (3) immediately implies Equation (4). Therefore, both approaches intersect in the case of strongly log-concave objectives. Indeed, Theorem 8 and the analogous result of Domke et al. (2023) are both based on the same proof strategy by Gower et al. (2019). F.1 Auxiliary Lemmas Lemma 5. Let ๐œ™(๐‘ฅ) = ๐‘ฅ. Then, the parameterization is linear in the sense that ๐’ฏ๐€is a bilinear function such that ๐’ฏ๐€ ๐€ (๐’–) = ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) . for any ๐€, ๐€ ฮ›. ๐’ฏ๐€ ๐€ (๐’–) = (๐‘ช(๐€ ๐€ )) ๐’–+ (๐’Ž ๐’Ž ) = (๐‘ซ๐œ™(๐’” ๐’” ) + (๐‘ณ ๐‘ณ )) ๐’–+ (๐’Ž ๐’Ž ) , using the fact that ๐œ™is the identity function, = (๐‘ซ๐œ™(๐’”) ๐‘ซ๐œ™(๐’” ) + (๐‘ณ ๐‘ณ )) ๐’–+ (๐’Ž ๐’Ž ) = (๐‘ช(๐€) ๐‘ช(๐€ )) ๐’–+ (๐’Ž+ ๐’Ž ) = (๐‘ช(๐€) ๐’–+ ๐’Ž) (๐‘ช(๐€ ) ๐’–+ ๐’Ž ) = ๐’ฏ๐€(๐’–) ๐’ฏ๐€ (๐’–) . The linearity with respect to ๐’–is obvious. Lemma 6. Let the linear parameterization be used. Then, for any ๐€, ๐€ ฮ›, the inner product of the Jacobian of the reparameterization function satisfies the following equalities for any ๐’– โ„๐‘‘. (i) For the Cholesky family (Domke, 2019, Lemma 8), ๐€ = (1 + ๐’– 2 2) ๐ˆ (ii) For the mean-field family (Kim et al., 2023, Lemma 1), ๐€ = (1 + ๐‘ผ2 F) ๐ˆ, where ๐‘ผ= diag (๐‘ข1, , ๐‘ข๐‘‘). Lemma 7. Let the linear parameterization be used. Then, for any ๐€ ฮ› and any ๐’› โ„๐‘‘, the following relationships hold. (i) For the Cholesky family (Domke, 2019, Lemma 2), ๐”ผ(1 + ๐™ช 2 2) ๐’ฏ๐€(๐™ช) ๐’› 2 2 = (๐‘‘+ 1) ๐’Ž ๐’› 2 2 + (๐‘‘+ ๐‘˜๐œ‘) ๐‘ช 2 F (ii) For the mean-field family (Kim et al., 2023, Lemma 2), ๐”ผ(1 + ๐™2 F) ๐’ฏ๐€(๐™ช) ๐’› 2 2 ( ๐‘‘๐‘˜๐œ‘+ ๐‘˜๐œ‘ ๐‘‘+ 1) ๐’Ž ๐’› 2 2 + (2๐‘˜๐œ‘ ๐‘‘+ 1) ๐‘ช 2 F. Corollary 2. Let the linear parameterization be used and ๐€, ๐€ ฮ› be any pair of variational parameters. (i) For the Cholesky family, ๐”ผ(1 + ๐™ช 2 2) ๐’ฏ๐€ (๐™ช) ๐’ฏ๐€(๐™ช) 2 2 (๐‘˜๐œ‘+ ๐‘‘) ๐€ ๐€ 2 2 (ii) For the mean-field family, ๐”ผ(1 + ๐™2 F) ๐’ฏ๐€ (๐™ช) ๐’ฏ๐€(๐™ช) 2 2 (2๐‘˜๐œ‘ ๐‘‘+ 1) ๐€ ๐€ 2 2 Proof. The results are a direct consequence of Lemma 7 and Lemma 5. Proof of (i) We start from Lemma 7 as ๐”ผ(1 + ๐™ช 2 2) ๐’ฏ๐€ ๐€ (๐™ช) ๐’› 2 2 = (๐‘‘+ 1) (๐’Ž ๐’Ž ) ๐’› 2 2 + (๐‘‘+ ๐‘˜๐œ‘) ๐‘ช(๐€) ๐‘ช(๐€ ) 2 F, setting ๐’›= ๐ŸŽ, = (๐‘‘+ 1) ๐’Ž ๐’Ž 2 2 + (๐‘‘+ ๐‘˜๐œ‘) ๐‘ช(๐€) ๐‘ช(๐€ ) 2 F, and since ๐‘˜๐œ‘ 3 by the property of the kurtosis, (๐‘‘+ ๐‘˜๐œ‘) ( ๐’Ž ๐’Ž 2 2 + ๐‘ช(๐€) ๐‘ช(๐€ ) 2 F) = (๐‘‘+ ๐‘˜๐œ‘) ๐€ ๐€ 2 2. Proof of (ii) Similarly, for the mean-field family, we can apply Lemma 7 as ๐”ผ(1 + ๐™2 F) ๐’ฏ๐€ ๐€ (๐™ช) ๐’› 2 2 ( ๐‘‘๐‘˜๐œ‘+ ๐‘˜๐œ‘ ๐‘‘+ 1) (๐’Ž ๐’Ž ) ๐’› 2 2 + (2๐‘˜๐œ‘ ๐‘‘+ 1) ๐‘ช ๐‘ช 2 F, setting ๐’›= ๐ŸŽ, = ( ๐‘‘๐‘˜๐œ‘+ ๐‘˜๐œ‘ ๐‘‘+ 1) ๐’Ž ๐’Ž 2 2 + (2๐‘˜๐œ‘ ๐‘‘+ 1) ๐‘ช ๐‘ช 2 F, and since ๐‘˜๐œ‘ 3 by the property of the kurtosis, (2๐‘˜๐œ‘ ๐‘‘+ 1) ( ๐’Ž ๐’Ž 2 2 + ๐‘ช(๐€) ๐‘ช(๐€ ) 2 F) = (2๐‘˜๐œ‘ ๐‘‘+ 1) ๐€ ๐€ 2 2. Lemma 8. For the linear parameterization, ๐”ผ ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) 2 2 = ๐€ ๐€ 2 2 for any ๐€, ๐€ ฮ›. Proof. First notice that, for linear parameterizations, we have ๐”ผ ๐’ฏ๐€(๐™ช) 2 2 = ๐”ผ ๐‘ช๐™ช+ ๐’Ž 2 2 = ๐”ผ๐™ช ๐‘ช ๐‘ช๐™ช+ ๐’Ž 2 2 + 2๐’Ž ๐‘ช๐”ผ๐™ช = ๐”ผtr (๐™ช ๐‘ช ๐‘ช๐™ช) + ๐’Ž 2 2 + 2๐’Ž ๐‘ช๐”ผ๐™ช, rotating the elements of the trace, = tr (๐‘ช ๐‘ช๐”ผ๐™ช๐™ช ) + ๐’Ž 2 2 + 2๐’Ž ๐‘ช๐”ผ๐™ช, applying Assumption 1 = tr (๐‘ช ๐‘ช) + ๐’Ž 2 2 = ๐‘ช 2 F + ๐’Ž 2 2 = ๐€ 2 2. Combined with Lemma 5, we have ๐”ผ ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) 2 2 = ๐”ผ ๐’ฏ๐€ ๐€ (๐™ช) 2 2 = ๐€ ๐€ 2 2. F.2 Properties of the Evidence Lower Bound F.2.1 Smoothness Lemma 1. If the diagonal conditioner ๐œ™is ๐ฟโ„Ž-log-smooth, then the entropic regularizer โ„Ž(๐€) is ๐ฟโ„Ž-smooth. Proof. The entropic regularizer is โ„Ž(๐€) = H (๐œ‘) ๐‘–=1 log ๐œ™(๐‘ ๐‘–) , and depends only on the diagonal elements ๐‘ 1, , ๐‘ ๐‘‘of ๐‘ช. The Hessian of โ„Žis then a diagonal matrix, where only the entries that correspond to ๐‘ 1, , ๐‘ ๐‘‘are non-zero. The Lipschitz smoothness constant is then the constant ๐ฟโ„Ž< that satisfies ๐‘ 2 ๐‘– = d2 log ๐œ™ for all ๐‘–= 1, , ๐‘‘, which is the smoothness constant of ๐‘ ๐‘– log ๐œ™(๐‘ ๐‘–). Lemma 2. Let ๐™ƒbe a ๐‘› ๐‘›symmetric random matrix, where it is bounded as ๐™ƒ 2 ๐ฟ< almost surely. Also, let ๐™…be an ๐‘š ๐‘›random matrix such that ๐”ผ๐™… ๐™… 2 < . Then, ๐”ผ๐™… ๐™ƒ๐™… 2 ๐ฟ ๐”ผ๐™… ๐™… 2. Proof. By the property of the Rayleigh quotients, for a symmetric matrix ๐‘จ, its maximum eigenvalue is given in the variational form sup ๐’™ 1 ๐’™ ๐‘ฏ๐’™= ๐œŽmax (๐‘ฏ) ๐œŽmax (๐‘ฏ) 2 = ๐‘ฏ 2, where ๐œŽmax (๐‘จ) is the maximal eigenvalue of ๐‘จ. Notice the relationship with the โ„“2-operator norm. The inequality is strict only if all eigenvalues are negative. From the property above, ๐”ผ๐™… ๐™ƒ๐™… 2 = sup ๐’™ 2 1 ๐’™ (๐”ผ๐™… ๐™ƒ๐™…) ๐’™. By reparameterizing as ๐™ฎ= ๐™…๐’™, = sup ๐’™ 2 1 ๐”ผ๐™ฎ ๐™ƒ๐™ฎ, and the property of the โ„“2-operator norm, sup ๐’™ 2 1 ๐”ผ ๐™ƒ 2 ๐™ฎ 2 2 = sup ๐’™ 2 1 ๐”ผ ๐™ƒ 2 ๐™…๐’™ 2 2. From our assumption about the maximal eigenvalue of ๐‘ฏ, ๐ฟsup ๐’™ 2 1 ๐”ผ ๐™…๐’™ 2 2, denoting the โ„“2 vector norm as a quadratic form as, = ๐ฟsup ๐’™ 2 1 ๐’™ (๐”ผ๐™… ๐™…) ๐’™, again, by the property of the โ„“2-operator norm, ๐ฟ ๐”ผ๐™… ๐™… 2 sup ๐’™ 2 1 ๐’™ 2 2 = ๐ฟ ๐”ผ๐™… ๐™… 2. Lemma 9. For a 1-Lipschitz diagonal conditioner ๐œ™, the Jacobian of the location-scale reparameterization function ๐’ฏ๐€satisfies Proof. For notational clarity, we will occasionally represent ๐’ฏ๐€as ๐’ฏ๐€(๐™ช) = ๐’ฏ(๐€; ๐™ช) , such that ๐’ฏ๐‘–(๐€; ๐™ช) denotes the ๐‘–th component of ๐’ฏ๐€. From the definition of ๐’ฏ๐€, it is straightforward to notice that its Jacobian is the concatenation of 3 block matrices ๐‘ฑ๐’Ž= ๐’ฏ๐€(๐’–) ๐’Ž , ๐‘ฑ๐’”= ๐’ฏ๐€(๐’–) ๐’” , and ๐‘ฑ๐‘ณ= ๐’ฏ๐€(๐’–) The ๐’Žblock form a deterministic identity matrix which is shown by (Domke, 2020, Lemma 4). The proof strategy is as follows: we will directly compute the squared Jacobian through block matrix multiplication. The key is that, after expectation, the resulting matrix becomes diagonal. Then, the โ„“2 operator norm, or maximal eigenvalue, follows trivially as the maximal diagonal element. ๐€ = ๐”ผ[ ๐‘ฑ ๐’Ž ๐™… ๐’” ๐™… ๐‘ณ ] [๐‘ฑ๐’Ž ๐™…๐’” ๐™…๐‘ณ] = ๐”ผ[ ๐‘ฑ ๐’Ž๐‘ฑ๐’Ž ๐‘ฑ ๐’Ž๐™…๐’” ๐‘ฑ ๐’Ž๐™…๐‘ณ ๐™… ๐’”๐‘ฑ๐’Ž ๐™… ๐’”๐™…๐’” ๐™… ๐’”๐™…๐‘ณ ๐™… ๐‘ณ๐™…๐’Ž ๐™… ๐‘ณ๐™…๐’” ๐™… ๐‘ณ๐™…๐‘ณ ] = [ ๐ˆ ๐”ผ๐™…๐’” ๐”ผ๐™…๐‘ณ ๐”ผ๐™… ๐’” ๐”ผ๐™… ๐’”๐™…๐’” ๐”ผ๐™… ๐’”๐™…๐‘ณ ๐”ผ๐™… ๐‘ณ ๐”ผ๐™… ๐‘ณ๐™…๐’” ๐”ผ๐™… ๐‘ณ๐™…๐‘ณ ] . For ๐™…๐’”, the entries are ๐’ฏ๐‘–(๐™ช) ๐‘ ๐‘— = ๐œ™ (๐‘ ๐‘–) ๐˜ถ๐‘—๐Ÿ™๐‘–=๐‘—, which is a diagonal matrix. Thus, by Assumption 1, ๐”ผ๐™…๐’”= ๐Ž, ๐”ผ๐™… ๐’”๐™…๐’”= diag (๐“ (๐’”)) 2. For ๐‘ฑ๐’”, the entries are ๐’ฏ๐‘–(๐€; ๐™ช) ๐ฟ๐‘—๐‘˜ = ๐‘ข๐‘˜๐Ÿ™๐‘–=๐‘—. To gather some intuition, the case of ๐‘‘= 4 looks like the following: ๐˜ถ2 ๐˜ถ3 ๐˜ถ4 ๐˜ถ1 ๐˜ถ3 ๐˜ถ4 ๐˜ถ1 ๐˜ถ2 ๐˜ถ4 ๐˜ถ1 ๐˜ถ2 ๐˜ถ3 It is crucial to notice that the ๐‘–th row does not include ๐˜ถ๐‘–. This means that, the matrix ๐™… ๐’”๐™…๐‘ณhas entries that are either 0, or ๐œ™ (๐‘ ๐‘–) ๐˜ถ๐‘–๐˜ถ๐‘—for ๐‘– ๐‘—, which is ๐”ผ๐œ™ (๐‘ ๐‘–) ๐˜ถ๐‘–๐˜ถ๐‘—= 0 by Assumption 1. Therefore, Finally, the elements of ๐™… ๐‘ณ๐™…๐‘ณare ๐‘–=0 ๐˜ถ๐‘˜๐˜ถ๐‘š๐Ÿ™๐‘–=๐‘—๐Ÿ™๐‘–=๐‘™= ๐Ÿ™๐‘—=๐‘™(๐”ผ๐˜ถ๐‘˜๐˜ถ๐‘š) = ๐Ÿ™๐‘—=๐‘™๐Ÿ™๐‘˜=๐‘š, where the last equality follows from Assumption 1, which forms an identity matrix as Therefore, the expected-squared Jacobian is now ๐€ = [ ๐ˆ ๐”ผ๐™…๐’” ๐”ผ๐™…๐‘ณ ๐”ผ๐™… ๐’” ๐”ผ๐™… ๐’”๐™…๐’” ๐”ผ๐™… ๐’”๐™…๐‘ณ ๐”ผ๐™… ๐‘ณ ๐”ผ๐™… ๐‘ณ๐™…๐’” ๐”ผ๐™… ๐‘ณ๐™…๐‘ณ ] = [ ๐ˆ diag (๐“(๐’”)) 2 which, conveniently, is a diagonal matrix. The maximal singular value of a block-diagonal matrix is the maximal singular value of each block. And since each block is diagonal with only positive entries, the largest element forms the maximal singular value. As we assume that ๐œ™is 1-Lipchitz, the element of all blocks is lower-bounded by 0 and upper-bounded by 1. Therefore, the maximal singular value of the expected-squared Jacobian is bounded by 1. Theorem 1. Let โ„“be ๐ฟโ„“-smooth and twice differentiable. Then, the following results hold: (i) If ๐œ™is linear, the energy ๐‘“is ๐ฟโ„“-smooth. (ii) If ๐œ™is 1-Lipschitz, the energy โ„“is (๐ฟโ„“+ ๐ฟ๐‘ )-smooth if and only if Assumption 3 holds. Proof. For notational clarity, we will occasionally represent ๐’ฏ๐€as ๐’ฏ๐€(๐™ช) = ๐’ฏ(๐€; ๐™ช) , such that ๐’ฏ๐‘–(๐€; ๐™ช) denotes the ๐‘–th component of ๐’ฏ๐€. By the Leibniz and chain rule, the Hessian of the energy ๐‘“follows as 2๐‘“(๐€) = ๐”ผ 2 ๐€โ„“(๐’ฏ๐€(๐™ช)) 2โ„“(๐’ฏ๐€(๐™ช)) ๐’ฏ๐€(๐™ช) ๐‘–=1 D๐‘–โ„“(๐’ฏ๐€(๐™ช)) 2๐’ฏ๐‘–(๐€; ๐™ช) When ๐’ฏis linear with respect to ๐€, it is clear that we have ๐€2 = ๐ŸŽ. (5) Then, ๐‘‡non is zero. In contrast, ๐‘‡lin appears for both the linear and nonlinear cases. Therefore, ๐‘‡non fully characterizes the effect of nonlinearity in the reparameterization function. Now, the triangle inequality yields 2๐‘“(๐€) 2 = ๐‘‡lin + ๐‘‡non 2 ๐‘‡lin 2 + ๐‘‡non 2, where equality is achieved when either term is 0. On the contrary, the reverse triangle inequality states that || ๐‘‡lin 2 ๐‘‡non 2|| 2๐‘“(๐€) 2. This implies that, if either ๐‘‡lin or ๐‘‡non is unbounded, the Hessian is not bounded. Thus, ensuring that ๐‘‡lin and ๐‘‡non are bounded is sufficient and necessary to establish that ๐‘“is smooth. Proof of (i) The bound on the linear part, ๐‘‡lin, follows from Lemma 2 as ๐‘‡lin 2 = ๐”ผ( ๐’ฏ๐€(๐™ช) 2โ„“(๐’ฏ๐€(๐™ช)) ๐’ฏ๐€(๐™ช) ๐ฟโ„“ ๐”ผ( ๐’ฏ๐€(๐™ช) and from the 1-Lipschitzness of ๐œ™, Lemma 9 yields When ๐œ™is linear, it immediately follows from Equation (5) that 2๐‘“(๐€) 2 = ๐‘‡lin 2 ๐ฟโ„“, which is tight as shown by Domke (2020, Theorem 6). Proof of (ii) For the nonlinear part ๐‘‡non, we use the fact that ๐’ฏ๐‘–(๐€; ๐™ช) is given as ๐’ฏ๐‘–(๐€; ๐™ช) = ๐‘š๐‘–+ ๐œ™(๐‘ ๐‘–) ๐˜ถ๐‘–+ The second derivative of ๐’ฏ๐‘–is clearly non-zero only for the nonlinear part involving ๐‘ 1, , ๐‘ ๐‘‘. Thus, ๐‘‡non follows as ๐‘–=1 D๐‘–โ„“(๐’ฏ๐€(๐™ช)) 2๐’ฏ๐‘–(๐€; ๐™ช) ๐‘–=1 ๐‘”๐‘–(๐€; ๐™ช) 2๐’ฏ๐‘–(๐€; ๐™ช) ๐”ผ ๐‘‘ ๐‘–=1 ๐‘”๐‘–(๐€; ๐™ช) 2๐’ฏ๐‘–(๐€;๐™ช) Furthermore, the second-order derivatives with respect to ๐‘ 1, , ๐‘ ๐‘‘are given as ๐‘ 2 ๐‘— = ๐Ÿ™๐‘–=๐‘—๐œ™ (๐‘ ๐‘—) . Considering this, the only non-zero block of ๐‘‡non forms a diagonal matrix as ๐‘–=1 ๐‘”๐‘–(๐€; ๐™ช) 2๐’ฏ๐‘–(๐€; ๐™ช) ๐”ผ๐‘”1 (๐€; ๐™ช) 2๐’ฏ1(๐€;๐™ช) ๐”ผ๐‘”๐‘‘(๐€; ๐™ช) 2๐’ฏ๐‘‘(๐€;๐™ช) = [ ๐”ผ๐‘”1 (๐€; ๐™ช) ๐œ™ (๐‘ 1) ๐˜ถ1 ๐”ผ๐‘”๐‘‘(๐€; ๐™ช) ๐œ™ (๐‘ ๐‘‘) ๐˜ถ๐‘‘ ] This implies that the only non-zero entries of ๐‘‡non lie on its diagonal. Since the โ„“2 norm of a diagonal matrix is the value of the maximal diagonal element, ๐‘‡non 2 max ๐‘–=1, ,๐‘‘๐”ผ๐‘”๐‘–(๐€; ๐™ช) ๐œ™ (๐‘ ๐‘–) ๐˜ถ๐‘– ๐ฟ๐‘ , where ๐ฟ๐‘ is finite constant if Assumption 3 holds. On the contrary, if a finite ๐ฟ๐‘ does not exist, ๐‘‡non 2 cannot be bounded. Therefore, the energy is smooth if and only if Assumption 3 holds. When it does, the energy ๐‘“is ๐ฟ๐‘“+ ๐ฟ๐‘ smooth. Example 2. Let โ„“(๐’›) = (1/2) ๐’› ๐‘จ๐’›and the diagonal conditioner be ๐œ™(๐‘ฅ) = softplus (๐‘ฅ). Then, (i) if ๐‘จis dense and the variational family is the mean-field family or (ii) if ๐‘จis diagonal and the variational family is the Cholesky family, Assumption 3 holds with ๐ฟ๐‘  0.26034 (max๐‘–=1, ,๐‘‘๐ด๐‘–๐‘–). (iii) If ๐‘จis dense but the Cholesky family is used, Assumption 3 does not hold. Proof. Since the gradient is โ„“(๐’›) = ๐‘จ๐’›, combined with reparameterization, we have ๐‘”(๐€; ๐™ช) = ๐‘จ(๐‘ช๐™ช+ ๐’Ž) Then, for each coordinate ๐‘–= 1, , ๐‘‘, we have ๐”ผ๐‘”๐‘–(๐€; ๐™ช) ๐˜ถ๐‘–๐œ™ (๐‘ ๐‘–) = ๐”ผ( ๐‘˜ ๐‘— ๐ด๐‘–๐‘—๐ถ๐‘—๐‘˜๐˜ถ๐‘˜+ ๐‘— ๐ด๐‘–๐‘—๐‘š๐‘—) ๐˜ถ๐‘–๐œ™ (๐‘ ๐‘–) ๐‘˜ ๐‘— ๐ด๐‘–๐‘—๐ถ๐‘—๐‘˜๐”ผ๐˜ถ๐‘˜๐˜ถ๐‘–๐œ™ (๐‘ ๐‘–) + ๐‘— ๐ด๐‘–๐‘—๐‘š๐‘—๐”ผ๐˜ถ๐‘–๐œ™ (๐‘ ๐‘–), and from Assumption 1, ๐‘˜ ๐‘— ๐ด๐‘–๐‘—๐ถ๐‘—๐‘˜๐Ÿ™๐‘˜=๐‘– Furthermore, the diagonal of ๐‘ชinvolves ๐œ™such that ๐”ผ๐‘”๐‘–(๐€; ๐™ช) ๐˜ถ๐‘–๐œ™ (๐‘ ๐‘–) = ๐ด๐‘–๐‘–๐œ™(๐‘ ๐‘–) ๐œ™ (๐‘ ๐‘–) ๐‘—<๐‘– ๐ด๐‘–๐‘—๐ถ๐‘—๐‘–๐œ™ (๐‘ ๐‘–) For the softplus function, we have 0 < ๐œ™ (๐‘ ) < 1 for any finite ๐‘ , and we have sup ๐‘  ๐œ™(๐‘ ) ๐œ™ (๐‘ ) 0.26034, where the supremum was numerically approximated. Then, it is clear that ๐‘‡diag is finite as long as the diagonals of ๐‘จare finite. Furthermore, we have the following: (i) If ๐‘จis diagonal, then ๐‘‡off is 0. (ii) If ๐‘จis dense but ๐‘ชis diagonal due to the use of the mean-field family, ๐‘‡off is again 0. (iii) However, when both ๐‘จand ๐‘ชare not diagonal, ๐‘‡off can be made arbitrarily large. F.2.2 Convexity Lemma 10. Let โ„“be convex. Then, for a convex nonlinear ๐œ™, the inequality ๐€๐‘“(๐€) , ๐€ ๐€ ๐”ผ ๐‘”(๐€; ๐™ช) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) holds for all ๐€ ฮ› if and only if Assumption 4 holds. For the linear parameterization, the inequality becomes equality. Proof. First, notice that the left-hand side is ๐€๐‘“(๐€) , ๐€ ๐€ = ๐‘–=1 ๐€๐”ผโ„“(๐’ฏ๐€(๐™ช)) , ๐€ ๐€ = ๐”ผ ๐‘–=1 ( ๐’ฏ๐€(๐™ช) ๐€๐‘– )๐‘”(๐€; ๐™ช) , ๐€ ๐€ . By restricting us to the location-scale family, we then get ๐‘š๐‘– ) ๐‘”(๐€; ๐™ช) (๐‘š๐‘– ๐‘š ๐‘–) convexity with respect to ๐’Ž ๐ฟ๐‘–๐‘— ) ๐‘”(๐€; ๐™ช) (๐ฟ๐‘–๐‘— ๐ฟ ๐‘–๐‘—) convexity with respect to ๐‘ณ ๐‘ ๐‘– ) ๐‘”(๐€; ๐™ช) (๐‘ ๐‘– ๐‘  ๐‘–) convexity with respect to ๐’” and plugging the derivatives of the reparameterization function, ๐‘– ๐‘”๐‘–(๐€; ๐™ช) (๐‘š๐‘– ๐‘š ๐‘–) + ๐‘—<๐‘– ๐˜ถ๐‘—๐‘”๐‘–(๐€; ๐™ช) (๐ฟ๐‘–๐‘— ๐ฟ ๐‘–๐‘—) + ๐‘– ๐œ™ (๐‘ ๐‘–) ๐˜ถ๐‘–๐‘”๐‘–(๐€; ๐™ช) (๐‘ ๐‘– ๐‘  ๐‘–) ). On the other hand, the right-hand side follows as ๐”ผ ๐‘”(๐€; ๐™ช) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) = ๐”ผ( ๐‘”(๐€; ๐™ช) , ๐’Ž ๐’Ž + ๐‘”(๐€; ๐™ช) , (๐‘ณ ๐‘ณ ) ๐™ช + ๐‘”(๐€; ๐™ช) , (๐œฑ(๐’”) ๐œฑ(๐’” )) ๐™ช ) ๐‘– ๐‘”๐‘–(๐€; ๐™ช) (๐‘š๐‘– ๐‘š ๐‘–) + ๐‘—<๐‘– ๐˜ถ๐‘—๐‘”๐‘–(๐€; ๐™ช) (๐ฟ๐‘–๐‘— ๐ฟ ๐‘–๐‘—) + ๐‘– ๐‘”๐‘–(๐€; ๐™ช) ๐˜ถ๐‘–(๐œ™(๐‘ ๐‘–) ๐œ™(๐‘  ๐‘–)) ). The convexity with respect to the ๐’Žand ๐‘ณis clear from the first two terms they are equal. The statement is now up to the last term. That is, the statement holds if ๐‘– ๐‘”๐‘–(๐€; ๐™ช) ๐˜ถ๐‘–๐œ™ (๐‘ ๐‘–) (๐‘ ๐‘– ๐‘  ๐‘–) ๐”ผ ๐‘– ๐‘”๐‘–(๐€; ๐™ช) ๐˜ถ๐‘–(๐œ™(๐‘ ๐‘–) ๐œ™(๐‘  ๐‘–)) . (6) For this, we will show that Assumption 4 is both necessary and sufficient. Proof of sufficiency Equation (6) holds if ๐”ผ๐‘”๐‘–(๐€; ๐™ช) ๐˜ถ๐‘– 0 for all ๐‘–= 1, , ๐‘‘, which is non other than Assumption 4. Proof of necessity Suppose that the inequality ๐€๐‘“(๐€) , ๐€ ๐€ ๐”ผ ๐‘”(๐€; ๐™ช) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) holds for all ๐€ ฮ›, implying ๐‘– ๐”ผ๐˜ถ๐‘–๐‘”๐‘–(๐€; ๐™ช) ๐œ™ (๐‘ ๐‘–)(๐‘ ๐‘– ๐‘  ๐‘–) ๐‘– ๐”ผ๐˜ถ๐‘–๐‘”๐‘–(๐€; ๐™ช) ((๐œ™(๐‘ ๐‘–) ๐œ™(๐‘  ๐‘–)). For any ๐€, we are free to set any ๐€ ฮ› and check whether we can retrieve Assumption 4 for this specific ๐€. Now, for each axis ๐‘–, set ๐‘  ๐‘—= ๐‘ ๐‘—for all ๐‘— ๐‘–, then ๐”ผ๐˜ถ๐‘–๐‘”๐‘–(๐€; ๐™ช) ๐œ™ (๐‘ ๐‘–)(๐‘ ๐‘– ๐‘  ๐‘–) ๐”ผ๐˜ถ๐‘–๐‘”๐‘–(๐€; ๐™ช) (๐œ™(๐‘ ๐‘–) ๐œ™(๐‘  ๐‘–)) . Since ๐œ™is assumed to be convex such that ๐œ™ (๐‘ ๐‘–) (๐‘ ๐‘– ๐‘  ๐‘–) ๐œ™(๐‘ ๐‘–) ๐œ™(๐‘  ๐‘–) . it follows that ๐”ผ๐˜ถ๐‘–๐‘”๐‘–(๐€; ๐™ช) 0. (7) Therefore, for any ๐€ ฮ› it must be that Assumption 4 holds. Proposition 1. We have the following: (i) If โ„“is convex, then for the mean-field family, Assumption 4 holds. (ii) For the Cholesky family, there exists a convex โ„“where Assumption 4 does not hold. Proof. For (i), the key property is the monotonicity of the gradient. Proof of (i) For the mean-field family, recall that ๐ถ๐‘–๐‘–= ๐œ™(๐‘ ๐‘–) . Also, observe that ๐‘ช๐™ช+ ๐’Ž= (๐ถ11๐˜ถ1 + ๐‘š1, , ๐ถ๐‘‘๐‘‘๐˜ถ๐‘‘+ ๐‘š๐‘‘). By the property of convex functions, โ„“is monotone such that โ„“(๐’›) โ„“(๐’› ) , ๐’› ๐’› 0. Now, by setting ๐™ฏ= ๐‘ช๐™ช+ ๐’Žand ๐™ฏ = ๐‘ช๐™ช+ ๐’Ž ๐ถ๐‘–๐‘–๐˜ถ๐‘–๐ž๐‘–, we obtain โ„“(๐‘ช๐™ช+ ๐’Ž) โ„“(๐‘ช๐™ช+ ๐’Ž ๐ถ๐‘–๐‘–๐˜ถ๐‘–๐ž๐‘–) , ๐ถ๐‘–๐‘–๐˜ถ๐‘–๐ž๐‘– 0 for every ๐‘–= 1, , ๐‘‘. For the mean-field family, ๐‘ช๐™ช+ ๐’Ž ๐ถ๐‘–๐‘–๐˜ถ๐‘–๐ž๐‘–is now independent of ๐˜ถ๐‘–. Thus, ๐”ผ๐ถ๐‘–๐‘–๐˜ถ๐‘–D๐‘–โ„“(๐‘ช๐™ช+ ๐’Ž) ๐”ผ๐ถ๐‘–๐‘–๐˜ถ๐‘–D๐‘–โ„“(๐‘ช๐™ช+ ๐’Ž ๐ž๐‘–๐ถ๐‘–๐‘–๐˜ถ๐‘–) = ๐ถ๐‘–๐‘–(๐”ผ๐˜ถ๐‘–) (๐”ผD๐‘–๐‘“(๐‘ช๐™ช+ ๐’Ž ๐ž๐‘–๐ถ๐‘–๐‘–๐˜ถ๐‘–)) = 0, where D๐‘–๐‘“denotes the ๐‘–th axis of ๐‘“. Since ๐ถ๐‘–๐‘–> 0 by design, ๐”ผ๐ถ๐‘–๐‘–๐˜ถ๐‘–D๐‘–๐‘“(๐‘ช๐™ช+ ๐’Ž) > 0 ๐”ผ๐˜ถ๐‘–D๐‘–๐‘“(๐‘ช๐™ช+ ๐’Ž) > 0, which is Assumption 4. Proof of (ii) We provide an example that proves the statement. Let โ„“(๐’›) = 1 2๐’› ๐‘จ๐’›. Then, ๐‘”(๐€; ๐™ช) = โ„“(๐’ฏ๐€(๐™ช)) = ๐‘จ(๐‘ช๐™ช+ ๐’Ž) = ๐‘จ๐‘ช๐™ช+ ๐‘จ๐’Ž. Suppose that we choose ๐€such that ๐‘ช= [1 0 1 1] and ๐’Ž= ๐ŸŽ. Also, setting ๐‘จ= [ 1 2 2 5 ] , we get a strongly convex function โ„“. Then, ๐‘”(๐€; ๐™ช) = ๐‘จ๐‘ช๐™ช= [ 1 2 2 5 ] [1 0 1 1] [๐˜ถ1 ๐˜ถ2] = [ 1 2 3 5 ] [๐˜ถ1 ๐˜ถ2] = [ ๐˜ถ1 2๐˜ถ2 3๐˜ถ1 + 5๐˜ถ2 ] Finally, we have ๐”ผ๐‘”1 (๐€; ๐™ช) ๐˜ถ1 = ๐”ผ( ๐˜ถ1 2๐˜ถ2) ๐˜ถ1 = 1 < 0, which violates Assumption 4. Lemma 11. For any function ๐‘“ C1(โ„, โ„+) , there is no constant 0 < ๐ฟ< such that |๐‘“(๐‘ฅ) ๐‘“(๐‘ฆ)| ๐ฟ|๐‘ฅ ๐‘ฆ|. Proof. Suppose for the sake of contradiction that such ๐ฟ> 0 exists. Letting ๐‘ฆ ๐‘ฅgives |๐‘“ (๐‘ฅ)| ๐ฟ for all ๐‘ฅ โ„. For each ๐‘ฅ, either ๐‘“ (๐‘ฅ) ๐ฟor ๐‘“ (๐‘ฅ) ๐ฟholds. We discuss two cases based on the value of ๐‘“ (0). If ๐‘“ (0) ๐ฟ, we claim that ๐‘“ (๐‘ฅ) ๐ฟfor all ๐‘ฅ โ„. Otherwise, ๐‘“ (๐‘ฅ) < ๐ฟfor some ๐‘ฅimplies ๐‘“ (๐‘ฅ) ๐ฟ. By the intermediate value theorem (๐‘“ is continuous), there exists a point ๐‘ฆbetween 0 and ๐‘ฅthat attains the value ๐‘“ (๐‘ฆ) = 0, which is a contradiction. Now that ๐‘“ (๐‘ฅ) ๐ฟ> 0 for all ๐‘ฅ, ๐‘“is an increasing function. For any ๐‘ฅ< 0, we have ๐‘“(๐‘ฅ) = ๐‘“(๐‘ฅ) ๐‘“(0) + ๐‘“(0) = |๐‘“(๐‘ฅ) ๐‘“(0)| + ๐‘“(0) ๐ฟ|๐‘ฅ| + ๐‘“(0). Here, we can plug ๐‘ฅ = ๐‘“(0) ๐‘“(๐‘ฅ ) = ๐ฟ||| ๐‘“(0) ๐ฟ ||| + ๐‘“(0) = |๐‘“(0)| + ๐‘“(0) 0, which implies that ๐‘“(๐‘ฅ ) โ„+, which is a contradiction. Now we discuss the second case ๐‘“ (0) ๐ฟ. By a similar argument, ๐‘“ (๐‘ฅ) ๐ฟfor all ๐‘ฅ โ„. Thus, ๐‘“is a decreasing function. For any ๐‘ฅ> 0, we have ๐‘“(๐‘ฅ) = ๐‘“(๐‘ฅ) ๐‘“(0) + ๐‘“(0) = |๐‘“(๐‘ฅ) ๐‘“(0)| + ๐‘“(0) ๐ฟ๐‘ฅ+ ๐‘“(0). Picking ๐‘ฅ = ๐‘“(0) ๐ฟ results in ๐‘“(๐‘ฅ ) โ„+, which is a contradiction. Theorem 2. Let โ„“be ๐œ‡-strongly convex. Then, we have the following: (i) If ๐œ™is linear, the energy ๐‘“is ๐œ‡-strongly convex. (ii) If ๐œ™is convex, the energy ๐‘“is convex if and only if Assumption 4 holds. (iii) If ๐œ™is such that ๐œ™ C1 (โ„, โ„+), the energy ๐‘“is not strongly convex. Proof. The special case (i) is proven by Domke (2020, Theorem 9). We focus on the general statement (ii). If โ„“is ๐œ‡-strongly convex, the inequality โ„“(๐’›) โ„“(๐’› ) โ„“(๐’› ) , ๐’› ๐’› + ๐œ‡ 2 ๐’› ๐’› 2 2 (8) holds, where the general convex case is obtained as a special case with ๐œ‡= 0. The goal is to relate this to the (๐œ‡-strong-)convexity of the energy with respect to the variational parameters given by ๐‘“(๐€) ๐‘“(๐€ ) ๐€๐‘“(๐€ ) , ๐€ ๐€ + ๐œ‡ Proof of (ii) Plugging the reparameterized latent variables to Equation (8) and taking the expectation, we have ๐”ผโ„“(๐’ฏ๐€(๐™ช)) ๐”ผโ„“(๐’ฏ๐€ (๐™ช)) ๐”ผ โ„“(๐’ฏ๐€ (๐™ช)) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) + ๐œ‡ 2 ๐”ผ ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) 2 2 ๐‘“(๐€) ๐‘“(๐€ ) ๐”ผ โ„“(๐’ฏ๐€ (๐™ช)) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) + ๐œ‡ 2 ๐”ผ ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) 2 2 ๐‘“(๐€) ๐‘“(๐€ ) ๐”ผ ๐‘”(๐€ ; ๐™ช) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) + ๐œ‡ 2 ๐”ผ ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) 2 2 Thus, the energy is convex if and only if ๐”ผ ๐‘”(๐€; ๐™ช) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) ๐‘“(๐€) , ๐€ ๐€ holds. This is established by Lemma 10. Proof of (iii) We now prove that, under the nonlinear parameterization, the energy cannot be strongly convex. When the energy is convex, it is also strongly convex if and only if ๐œ‡ 2 ๐”ผ ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) 2 2 ๐œ‡ From the proof of Domke (2020, Lemma 5), it follows that ๐”ผ ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) 2 2 = ๐‘ช ๐‘ช 2 F + ๐’Ž ๐’Ž 2 2. Furthermore, under nonlinear parameterizations, ๐‘ช ๐‘ช 2 F + ๐’Ž ๐’Ž 2 2 = (๐‘ซ๐œ™(๐’”) ๐‘ซ๐œ™(๐’” )) (๐‘ณ ๐‘ณ ) F + ๐’Ž ๐’Ž 2 2, expanding the quadratic, = ๐‘ซ๐œ™(๐’”) ๐‘ซ๐œ™(๐’” ) F + ๐‘ณ ๐‘ณ 2 F 2 ๐‘ซ๐œ™(๐’”) ๐‘ซ๐œ™(๐’” ) , ๐‘ณ ๐‘ณ F + ๐’Ž ๐’Ž 2 2, and since ๐‘ซ๐œ™(๐’”) and ๐‘ณreside in different sub-spaces, they are orthogonal. Thus, = ๐‘ซ๐œ™(๐’”) ๐‘ซ๐œ™(๐’” ) F + ๐‘ณ ๐‘ณ 2 F + ๐’Ž ๐’Ž 2 2 = ๐œ™(๐’”) ๐œ™(๐’” ) 2 2 + ๐‘ณ ๐‘ณ 2 F + ๐’Ž ๐’Ž 2 2. (9) For the energy term to be strongly convex, Equation (9) must be bounded below by ๐€ ๐€ 2 2. Evidently, this implies that a necessary and sufficient condition is that ||๐œ™(๐‘ ๐‘–๐‘–) ๐œ™(๐‘  ๐‘–๐‘–)|| ๐ฟ||๐‘ ๐‘–๐‘– ๐‘  ๐‘–๐‘–|| by some constant 0 < ๐ฟ< . Notice that the direction of the inequality is reversed from the Lipschitz condition. Unfortunately, there is no such continuous and differentiable function ๐œ™ โ„ โ„+, as established by Lemma 11. Thus, for any diagonal conditioner ๐œ™ C1 (โ„, โ„+), the energy cannot be strongly convex. F.3 Convergence of Black-Box Variational Inference F.3.1 Vanilla Black-Box Variational Inference Theorem 5. Let the variational family satisfy Assumption 2, the likelihood satisfy Assumption 5, and the assumptions of Corollary 1 hold such that the ELBO, ๐น, is ๐ฟ๐น-smooth with ๐ฟ๐น= ๐ฟโ„“+ ๐ฟ๐œ™+ ๐ฟ๐‘ . Then, if the stepsize satisfy ๐›พ< 1/๐ฟ๐น, the iterates of BBVI with SGD and the ๐‘€-sample reparameterization gradient estimator satisfy min 0 ๐‘ก ๐‘‡ 1 ๐”ผ ๐น(๐€๐‘ก) 2 2 ๐›พ2๐ฟ๐น๐ฟโ„“๐œ…๐ถ(๐‘‘, ๐œ‘) ๐‘€ ( ๐’›joint ๐’›like 2 2 + 2 (๐น ๐‘“ L)) ๐›พ๐‘‡(1 + ๐›พ2 4๐ฟ๐น๐ฟโ„“๐œ… ๐‘‡ (๐น(๐€0) ๐น ) . ๐’›joint = projโ„“(๐’›) is the projection of ๐’›onto set of minimizers of โ„“ ๐’›like = projโ„“like (๐’›) is the projection of ๐’›onto set of minimizers of โ„“like, ๐œ…= ๐ฟโ„“/๐œ‡ is the condition number, ๐น = inf ๐€ ฮ› ๐น(๐€) , โ„“ like = inf ๐€ โ„๐‘‘โ„“like (๐’›) , ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘ for the Cholesky nonlinear, ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field nonlinear, ๐‘€ is the number of Monte Carlo samples. Proof. Khaled & Richtรกrik (2023, Theorem 2) show that, if the objective function ๐นis ๐ฟ๐น-smooth and the stochastic gradients satisfy the ๐ด๐ต๐ถgiven as 2 ๐ด(๐น(๐€) ๐น ) + ๐ต ๐น 2 2 + ๐ถ for some 0 < ๐ด, ๐ต, ๐ถ< , SGD guarantees min 0 ๐‘ก ๐‘‡ 1 ๐”ผ ๐น(๐€๐‘ก) 2 2 ๐ฟ๐น๐ถ๐›พ+ 2(1 + ๐ฟ๐น๐›พ2๐ด) ๐‘‡ ๐›พ๐‘‡ (๐น(๐€0) ๐น ) . Under the conditions of Corollary 1, ๐นis ๐ฟ๐น-smooth with ๐ฟ๐น= ๐ฟโ„“+ ๐ฟ๐‘ + ๐ฟ๐œ™. Furthermore, under Assumption 5, Kim et al. (2023) show that the Monte Carlo gradient estimates satisfy 2 4๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐œ‡๐‘€ (๐น(๐€) ๐น ) + ๐ต ๐น 2 2 + 2๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐œ‡๐‘€ ๐’›joint ๐’›like 2 2 + 4๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐œ‡๐‘€ (๐น โ„“ like) , This means that the ๐ด๐ต๐ถcondition is satisfied with constants ๐ด= 4๐ฟ2 โ„“ ๐œ‡๐‘€๐ถ(๐‘‘, ๐œ‘) , ๐ต= 1, ๐ถ= 2๐ฟ2 โ„“ ๐œ‡๐‘€๐ถ(๐‘‘, ๐œ‘) ๐’›joint ๐’›like 2 2 + 4๐ฟ2 โ„“ ๐œ‡๐‘€๐ถ(๐‘‘, ๐œ‘) (๐น โ„“ like) . Plugging these constants in, we obtain min 0 ๐‘ก ๐‘‡ 1 ๐”ผ ๐‘“(๐€๐‘ก) 2 2 ๐›พ2๐ฟ๐น๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐œ‡๐‘€ ( ๐’›joint ๐’›like 2 2 + 2 (๐น โ„“ like)) ๐›พ๐‘‡(1 + ๐›พ2๐ฟ๐น 4๐ฟ2 โ„“ ๐œ‡๐‘€๐ถ(๐‘‘, ๐œ‘)) ๐‘‡ (๐น(๐€0) ๐น ) . Substituting the condition number yields the stated result. Theorem 3. Let Assumption 2 hold, the likelihood satisfy Assumption 5, and the assumptions of Corollary 1 hold such that the ELBO ๐นis ๐ฟ๐น-smooth with ๐ฟ๐น= ๐ฟโ„“+ ๐ฟ๐œ™+ ๐ฟ๐‘ . Then, the iterates generated by BBVI through Equation (1) and the ๐‘€-sample reparameterization gradient include an ๐œ–-stationary point such that min0 ๐‘ก ๐‘‡ 1 ๐”ผ ๐น(๐€๐‘ก) 2 ๐œ–for any ๐œ–> 0 if ๐‘‡ ๐’ช( (๐น(๐€0) ๐น ) 2๐ฟ๐น๐ฟ2 โ„“๐ถ(๐‘‘, ๐‘˜๐œ‘) ๐œ‡๐‘€๐œ–4 ) for some fixed stepsize ๐›พ, where ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family. Proof. As a corollary to Theorem 5, Khaled & Richtรกrik (2023, Corollary 1) show that, for an ๐ฟ๐นsmooth objective function ๐น, a gradient estimator satisfying the ABC condition, an ๐œ–-stationary point can be encountered if ๐ฟ๐น๐ด๐‘‡ , 1 ๐ฟ๐น๐ต, ๐œ– 2๐ฟ๐น๐ถ) , ๐‘‡ 12 (๐น(๐€0) ๐น ) ๐ฟ๐น ๐œ–2 max (๐ต, 12 (๐น(๐€0) ๐น ) ๐ด Under Assumption 5, Kim et al. (2023) show that the Monte Carlo gradient estimates satisfy 2 4๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐œ‡๐‘€ (๐น(๐€) ๐น ) + ๐ต ๐น 2 2 + 2๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐œ‡๐‘€ ๐’›joint ๐’›like 2 2 + 4๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐œ‡๐‘€ (๐น โ„“ like) , This means that the ๐ด๐ต๐ถcondition is satisfied with constants ๐ด= 4๐ฟ2 ๐‘“ ๐œ‡๐‘€๐ถ(๐‘‘, ๐œ‘) , ๐ต= 1, ๐ถ= 2๐ฟ2 ๐‘“ ๐œ‡๐‘€๐ถ(๐‘‘, ๐œ‘) ( ๐’›joint ๐’›like 2 2 + 2 (๐น ๐‘“ L)) . ๐’›joint = projโ„“(๐’›) is the projection of ๐’›onto set of minimizers of โ„“ ๐’›like = projโ„“like (๐’›) is the projection of ๐’›onto set of minimizers of โ„“like, ๐น = inf ๐€ ฮ› ๐น(๐€) , โ„“ like = inf ๐€ โ„๐‘‘โ„“like (๐’›) , ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘ for the Cholesky family, ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family, ๐‘€ is the number of Monte Carlo samples. Plugging these constants in, we obtain ๐‘‡ 12 (๐น(๐€0) ๐น ) ๐ฟ๐น ๐œ–2 max (1, 48 (๐น(๐€0) ๐น ) ๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐œ‡๐‘€๐œ–2 , 8๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ( ๐’›joint ๐’›like 2 2 + (๐น โ„“ like)) = ๐’ช((๐น(๐€0) ๐น ) 2๐ฟ๐น๐ฟ2 โ„“๐ถ(๐‘‘) ๐œ‡๐‘€๐œ–4 ) , where we omitted the dependence on ๐‘˜๐œ‘and the minimizers of โ„“and โ„“like. F.3.2 Proximal Black-Box Variational Inference Lemma 3 (Convex Expected Smoothness). Let โ„“be ๐ฟโ„“-smooth and ๐œ‡-strongly convex with the variational family satisfying Assumption 2 with the linear parameterization. Then, ๐”ผ ๐€๐‘“(๐€; ๐™ช) ๐€ ๐‘“(๐€ ; ๐™ช) 2 2 2๐ฟโ„“๐œ…๐ถ(๐‘‘, ๐œ‘) B๐‘“(๐€, ๐€ ) holds, where B๐‘“(๐€, ๐€ ) ๐‘“(๐€) ๐‘“(๐€ ) ๐‘“(๐€ ) , ๐€ ๐€ is the Bregman divergence, ๐œ…= ๐ฟโ„“/๐œ‡ is the condition number, ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family, and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family. Proof. First, we have ๐”ผ ๐€๐‘“(๐€; ๐™ช) ๐€ ๐‘“(๐€ ; ๐™ช) 2 2 = ๐”ผ ๐€โ„“(๐’ฏ๐€(๐™ช)) ๐€ โ„“(๐’ฏ๐€ (๐™ช)) 2 2 ๐€ ๐‘”(๐€, ๐™ช) ๐’ฏ๐€ (๐™ช) For the linear parameterization, the Jacobian of ๐’ฏ๐€does not depend on ๐€. Therefore, ๐€ (๐‘”(๐€, ๐™ช) ๐‘”(๐€ , ๐™ช)) 2 and Lemma 6 yields = ๐ฝ๐’ฏ(๐™ช) ๐”ผ ๐‘”(๐€, ๐™ช) ๐‘”(๐€ , ๐™ช) 2 2, ๐ฝ๐’ฏ(๐’–) = 1 + ๐’– 2 2 for the Cholesky family and ๐ฝ๐’ฏ(๐’–) = 1 + ๐‘ผ2 F for the mean-field family. From now on, we apply the strategy of Domke (2019, Theorem 3) for resolving the randomness ๐™ช. That is, ๐”ผ๐ฝ๐’ฏ(๐™ช) ๐‘”(๐€, ๐™ช) ๐‘”(๐€ , ๐™ช) 2 2 = ๐ฝ๐’ฏ(๐™ช) โ„“(๐’ฏ๐€(๐™ช)) โ„“(๐’ฏ๐€ (๐™ช)) 2 2 from the ๐ฟโ„“-smoothness of ๐‘“, ๐ฟ2 โ„“๐”ผ๐ฝ๐’ฏ(๐™ช) ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) 2 2, and applying Corollary 2, ๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐€ ๐€ 2 2 The last step follows the approach of Kim et al. (2023), where we convert the quadratic bound into a bound involving the energy. Recall that the ๐œ‡-strongly convexity of โ„“implies ๐œ‡ 2 ๐’› ๐’› 2 2 โ„“(๐’›) โ„“(๐’› ) โ„“(๐’› ) , ๐’› ๐’› . (10) From Lemma 8, we have ๐ฟ2 โ„“๐ถ(๐‘‘, ๐œ‘) ๐€ ๐€ 2 2 = ๐ฟ2 ๐‘“๐ถ(๐‘‘, ๐œ‘) ๐”ผ ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) 2 2, and by ๐œ‡-strongly convexity, 2๐ฟ2 โ„“ ๐œ‡๐ถ(๐‘‘, ๐œ‘) ๐”ผ(โ„“(๐’ฏ๐€(๐™ช)) โ„“(๐’ฏ๐€ (๐™ช)) โ„“(๐’ฏ๐€ (๐™ช)) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) ) = 2๐ฟ2 โ„“ ๐œ‡๐ถ(๐‘‘, ๐œ‘) ๐”ผ(๐‘“(๐€; ๐™ช) ๐‘“(๐€ ; ๐™ช) ๐‘”(๐€ ; ๐™ช) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) ) = 2๐ฟ2 โ„“ ๐œ‡๐ถ(๐‘‘, ๐œ‘) (๐‘“(๐€) ๐‘“(๐€ ) ๐”ผ ๐‘”(๐€ ; ๐™ช) , ๐’ฏ๐€(๐™ช) ๐’ฏ๐€ (๐™ช) ). Finally, by applying the equality in Lemma 10, = 2๐ฟ2 โ„“ ๐œ‡๐ถ(๐‘‘, ๐œ‘) (๐‘“(๐€) ๐‘“(๐€ ) ๐‘“(๐€ ) , ๐€ ๐€ ). Lemma 12 (Variance Transfer). Let โ„“be ๐ฟโ„“-smooth and ๐œ‡-strongly convex with the variational family satisfying Assumption 2 with the linear parameterization. Also, let ห† ๐‘“be an ๐‘€-sample gradient estimator of the energy. Then, tr ๐•ห† ๐‘“(๐€) 4๐ฟโ„“๐œ…๐ถ(๐‘‘, ๐œ‘) ๐‘€ B๐‘“(๐€, ๐€ ) + 2 tr ๐•ห† ๐‘“(๐€ ) , ๐œ…= ๐ฟโ„“/๐œ‡is the condition number, B๐‘“is the Bregman divergence defined in Lemma 3, ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family, and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family. Proof. First, the ๐‘€-sample gradient estimator is defined as ๐‘š=1 ๐€๐‘“(๐€; ๐™ช๐‘š) , where ๐™ช๐‘š ๐œ‘. Since ๐™ช1, , ๐™ช๐‘šare independent and identically distributed, we have tr ๐•ห† ๐‘“(๐€) = 1 ๐‘€tr ๐• ๐€๐‘“(๐€; ๐™ช) . From here, given Lemma 3, the proof is identical with that of Garrigos & Gower (2023, Lemma 8.20), except for the constants. Theorem 6. Let โ„“be ๐ฟโ„“-smooth and ๐œ‡-strongly convex. Then, BBVI with proximal SGD in Equation (2), ๐‘€-Monte Carlo samples, a variational family satisfying Assumption 2, the linear parameterization, and a fixed stepsize 0 < ๐›พ ๐‘€ 2๐ฟโ„“๐œ…๐ถ(๐‘‘,๐œ‘), the iterates satisfy ๐”ผ ๐€๐‘‡ ๐€ 2 2 (1 ๐›พ๐œ‡) ๐‘‡ ๐€0 ๐€2 2 2 + 2๐›พ๐œŽ2 where ๐œ…= ๐ฟโ„“/๐œ‡is the condition number, ๐œŽ2 is defined in Lemma 4, ๐€ = arg min๐€ ฮ› ๐น(๐€), ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family, and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family. Proof. Provided that (A.6.1) the energy ๐‘“is ๐œ‡-strongly convex, (A.6.2) the energy ๐‘“is ๐ฟโ„“-smooth, (A.6.3) the regularizer โ„Žis convex, (A.6.4) the regularizer โ„Žis lower semi-continuous, (A.6.5) the convex expected smoothness condition holds, (A.6.6) the variance transfer condition holds, and (A.6.7) the gradient variance ๐œŽ2 at the optimum is finite such that ๐œŽ2 < , the proof is identical to that of Garrigos & Gower (2023, Theorem 11.9), which is based on the results of Gorbunov et al. (2020, Corollary A.2). In our setting, (A.6.1) is established by Theorem 2, (A.6.2) is established by Theorem 1, (A.6.3) is trivially satisfied since โ„Žis the negative entropy, (A.6.4) is trivially satisfied since โ„Žis continuous, (A.6.5) is established in Lemma 3, (A.6.6) is established in Lemma 12, (A.6.7) is established in Lemma 4. The only difference is that, we replace the constant ๐ฟmax in the proof of Garrigos & Gower to ๐ฟโ„“๐œ…๐ถ(๐‘‘, ๐œ‘)/๐‘€. This stems from the different constants in the variance transfer condition. Theorem 7. Let โ„“be ๐ฟโ„“-smooth and ๐œ‡-strongly convex. Then, for any ๐œ–> 0, BBVI with proximal SGD in Equation (2), ๐‘€-Monte Carlo samples, a variational family satisfying Assumption 2, and the linear parameterization guarantees ๐”ผ ๐€๐‘‡ ๐€ 2 2 ๐œ–if 2 ๐œ‡ 2๐œŽ2 , ๐‘€ 2๐ฟโ„“๐œ…๐ถ(๐‘‘, ๐œ‘)) , ๐‘‡ max (1 ๐œ‡2 , 2๐œ…2 ๐ถ(๐‘‘, ๐œ‘) ๐‘€ ) log (2 ๐€0 ๐€ where ๐œ…= ๐ฟโ„“/๐œ‡, ๐œŽ2 is defined in Lemma 4, ๐€ = arg min๐€ ฮ› ๐น(๐€), ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family, and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family. Proof. This is a corollary of the fixed stepsize convergence guarantee in Theorem 6 as shown by Garrigos & Gower (2023, Corollary 11.10). They guarantee an ๐œ–-accurate solution as long as 2 2 2๐œŽ F , 1 2๐ฟmax ) , ๐‘‡ max (1 ๐œ– 4๐œŽ F ๐œ‡2 , 2๐ฟmax ๐œ‡ ) log (2 ๐€0 ๐€ In our notation, ๐œŽ F = ๐œŽ2 and ๐ฟmax = ๐ฟโ„“๐œ…๐ถ(๐‘‘, ๐œ‘)/๐‘€. Theorem 8. Let โ„“be ๐ฟโ„“-smooth and ๐œ‡-strongly convex. Then, BBVI with proximal SGD in Equation (2), the ๐‘€-sample reparameterization gradient estimator, a variational family satisfying Assumption 2, the linear parameterization, ๐‘‡ 4๐‘‡๐œ…, and a stepsize schedule of 2๐ฟโ„“๐œ…๐ถ(๐‘‘,๐œ‘) for ๐‘ก 4๐‘‡๐œ… 2๐‘ก+1 (๐‘ก+1)2๐œ‡ for ๐‘ก> 4๐‘‡๐œ…, where ๐‘‡๐œ…= ๐œ…2๐ถ(๐‘‘, ๐œ‘) ๐‘€ 1 , ๐œ…= ๐ฟโ„“/๐œ‡is the condition number, ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family, and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family, then the iterates satisfy ๐”ผ ๐€๐‘‡ ๐€ 2 2 16 ๐‘‡2 ๐œ… ๐€0 ๐€ 2 2 e2๐‘‡2 + 8๐œŽ2 where ๐œŽ2 is defined in Lemma 4, e is Euler s constant, and ๐€ = arg min๐€ ฮ› ๐น(๐€). Proof. Under our assumptions, Theorem 6 holds, of which the proof is essentially obtaining the recursion ๐”ผ ๐€๐‘ก+1 ๐€ 2 2 = (1 ๐›พ๐‘ก๐œ‡) ๐”ผ ๐€๐‘ก ๐€ 2 2 + 2๐›พ2 ๐‘ก๐œŽ2. Instead of a fixed stepsize, we can apply the decreasing stepsize rule in the proof statement, then which the proof becomes identical to that of Gower et al. (2019, Theorem 3.2). We only need to replace โ„’with ๐ฟmax in the proof of Garrigos & Gower (2023, Theorem 11.9). This, in our notation, is ๐ฟmax = ๐ฟโ„“๐œ…๐ถ(๐‘‘, ๐œ‘)/๐‘€. Theorem 4. Let โ„“be ๐ฟโ„“-smooth and ๐œ‡-strongly convex. Then, for any ๐œ–> 0, BBVI with proximal SGD in Equation (2), the ๐‘€-sample reparameterization gradient estimator, a variational family satisfying Assumption 2 with the linear parameterization guarantees ๐”ผ ๐€๐‘‡ ๐€ 2 2 ๐œ–if 2๐ฟโ„“๐œ…๐ถ(๐‘‘,๐œ‘) for ๐‘ก 4๐‘‡๐œ… 2๐‘ก+1 (๐‘ก+1)2๐œ‡ for ๐‘ก> 4๐‘‡๐œ…, ๐‘‡ max ( 8๐œŽ2 ๐œ‡2 ๐œ–+ 4๐‘‡๐œ… ๐€0 ๐€ 2 where ๐œŽ2 is defined in Lemma 4, ๐‘‡๐œ…= ๐œ…2๐ถ(๐‘‘, ๐œ‘) ๐‘€ 1 , ๐œ…= ๐ฟโ„“/๐œ‡is the condition number, e is Euler s constant, ๐€ = arg min๐€ ฮ› ๐น(๐€), ๐ถ(๐‘‘, ๐œ‘) = ๐‘‘+ ๐‘˜๐œ‘for the Cholesky family, and ๐ถ(๐‘‘, ๐œ‘) = 2๐‘˜๐œ‘ ๐‘‘+ 1 for the mean-field family. Proof. The computational complexity follows from the smallest number of iterations ๐‘‡such that ๐”ผ ๐€๐‘‡ ๐€ 2 2 16๐‘‡2 ๐œ… ๐€0 ๐€ 2 2 e2๐‘‡2 + 8๐œŽ2 By multiplying both sides with ๐‘‡2 as ๐œ‡2 ๐‘‡ 16๐‘‡2 ๐œ… ๐€0 ๐€ 2 2 e2 0, (11) we can see that we are looking for the smallest positive integer that is larger than the solution of a quadratic equation with respect to ๐‘‡. This is given as 2 + 64๐œ– ๐‘‡2๐œ… ๐€0 ๐€ 2 2 e2 Applying the inequality ๐‘Ž+ ๐‘ ๐‘Ž+ ๐‘, 2 + 64๐œ– ๐‘‡2๐œ… ๐€0 ๐€ 2 2 e2 ๐œ‡2 ) + 64๐œ– ๐‘‡2๐œ… ๐€0 ๐€ 2 2 e2 2๐œ– ๐œ‡2 + ๐œ– 8๐‘‡๐œ… ๐€0 ๐€ 2 ๐œ‡2 ๐œ–+ 4๐‘‡๐œ… ๐€0 ๐€ 2 Thus, ๐”ผ ๐€๐‘‡ ๐€ 2 2 ๐œ–can be satisfied with a number of iterations at least ๐‘‡ max ( 8๐œŽ2 ๐œ‡2 ๐œ–+ 4๐‘‡๐œ… ๐€0 ๐€ 2 e ๐œ– , 4๐‘‡๐œ…) . G Details of Experimental Setup Table 2: Summary of Datasets and Problems Abbrev. Model Dataset ๐‘‘ ๐‘ LME-election Linear Mixed Effects 1988 U.S. presidential election (Gelman & Hill, 2007) 90 11,566 LME-radon U.S. household radon levels (Gelman & Hill, 2007) 391 12,573 BT-tennis Bradley-Terry ATP World Tour tennis 6030 172,199 Linear Regression KEGG-undirected (Shannon et al., 2003) 31 63,608 LR-song million songs (Bertin-Mahieux et al., 2011) 94 515,345 LR-buzz buzz in social media (Kawala et al., 2013) 81 583,250 LR-electric household electric 15 2,049,280 AR-ecg Sparse Autoregression Long-term ST ECG (Jager et al., 2003) 63 20,642,000 Linear Regression (LR-*) We consider a basic Bayesian hierarchical linear regression model ๐œŽ๐›ผ ๐’ฉ+ (0, 102) , ๐œŽ๐œท ๐’ฉ+ (0, 102) , ๐œŽ ๐’ฉ+ (0, 0.32) ๐œท ๐’ฉ(0, ๐œŽ2 ๐œท๐‘ฐ) , ๐›ผ ๐’ฉ(0, ๐œŽ2 ๐›ผ) , ๐‘ฆ๐‘– ๐’ฉ(๐œท ๐’™๐‘–+ ๐›ผ, ๐œŽ2) , where a weakly informative half-normal hyperprior ๐’ฉ+, a normal distribution with the support restricted to โ„+, is assigned on the hyperparameters. For the datasets, we consider large-scale regression problems obtained from the UCI repository (Dua & Graff, 2017), shown in Table 2. For all datasets, we standardize the regressors ๐’™๐‘–and the outcomes ๐‘ฆ๐‘–. Radon Levels (MLE-radon) MLE-radon is a radon level regression problem by Gelman & Hill (2007). It fits a hierarchical mixed-effects model for estimating household radon levels across different counties while considering the floor elevation of each site. The model is described as ๐œŽ ๐’ฉ+ (0, 12) , ๐œŽ๐›ผ ๐’ฉ+ (0, 12) , ๐œ‡๐›ผ ๐’ฉ(0, 102) , ๐ ๐’ฉ(0, 102๐ˆ) ๐›ฝ1 ๐’ฉ(0, 102) , ๐›ฝ2 ๐’ฉ(0, 102) ๐œถ= ๐œ‡๐›ผ+ ๐œŽ๐›ผ๐ ๐œ‡๐‘–= ๐›ผ[county๐‘–] + ๐›ฝ1 log (uppm๐‘–) + floor๐‘–๐›ฝ2 log radon๐‘– ๐’ฉ(๐œ‡๐‘–, ๐œŽ2) , which uses variable slopes and intercepts with non-centered parameterization. The dataset was obtained from Posterior DB (Magnusson et al., 2022). Also, for the radon regression problem, the Minnesota subset is often used due to computational reasons. Here, we use the full national dataset. Presidential Election (MLE-election) MLE-election is a model for studying the effects of sociological factors on the 1988 United States presidential election (Gelman & Hill, 2007). The model is described as ๐œŽage ๐’ฉ(0, 1002) , ๐œŽedu ๐’ฉ(0, 1002) , ๐œŽage edu ๐’ฉ(0, 1002) ๐œŽstate ๐’ฉ(0, 1002) , ๐œŽregion ๐’ฉ(0, 1002) , ๐’ƒage ๐’ฉ(0, ๐œŽ2 age๐ˆ) , ๐’ƒedu ๐’ฉ(0, ๐œŽ2 edu๐ˆ) , ๐’ƒage edu ๐’ฉ(0, ๐œŽ2 age edu๐ˆ) , ๐’ƒstate ๐’ฉ(0, ๐œŽ2 state๐ˆ) , ๐’ƒregion ๐’ฉ(0, ๐œŽ2 region๐ˆ) ๐œท ๐’ฉ(0, 1002๐ˆ) ๐‘๐‘–= ๐›ฝ1 + ๐›ฝ2 black๐‘–+ ๐›ฝ3 female๐‘–+ ๐›ฝ4 ๐‘ฃprev,๐‘–+ ๐›ฝ5 female๐‘–black๐‘– + ๐‘age[age๐‘–] + ๐‘edu[edu๐‘–] + ๐‘age edu[age๐‘–edu๐‘–] + ๐‘state[state๐‘–] + ๐‘region[region๐‘–] ๐‘ฆ๐‘– bernoulli (๐‘๐‘–) . The dataset was obtained from Posterior DB (Magnusson et al., 2022). Bradley-Terry (BT-Tennis) BT-Tennis is a Bradley-Terry model for estimating the skill of professional tennis players used by Giordano et al. (2023). The model is described as ๐œŽ ๐’ฉ+ (0, 1) ๐œฝ ๐’ฉ(๐ŸŽ, ๐œŽ2๐ˆ) ๐‘๐‘– ๐œƒ[win๐‘–] ๐œƒ[los๐‘–] ๐‘ฆ๐‘– bernoulli (๐‘) , where win๐‘–, los๐‘–are the indices of the winning and losing players for the ๐‘–th game, respectively. While we subsample over the games ๐‘–= 1, , ๐‘, each player s involvement is sparse in that each player plays only a handful of games. Consequently, the subsampling noise is substantial. Therefore, we use a larger batch size of 500. Similarly to Giordano et al. (2023), we use the ATP World Tour data publically available online 1. Autoregression (AR-ecg) AR-ecg is a linear autoregressive model. Here, we use a Student-t likelihood as originally proposed by Christmas & Everson (2011). While they originally imposed an automatic relevance detection prior on the autoregressive coefficients, we instead set a horseshoe shrinkage prior (Carvalho et al., 2009, 2010). Since the horseshoe is known to result in complex posterior geometry, this should make the problem more challenging. The model is described as ๐›ผ๐‘‘= 10 2, ๐›ฝ๐‘‘= 10 2, ๐›ผ๐‘‘= 10 2, ๐›ฝ๐‘‘= 10 2, ๐‘‘ gamma (๐›ผ๐‘‘, ๐›ฝ๐‘‘) , ๐œŽ 1 inverse-gamma (๐›ผ๐œŽ, ๐›ฝ๐œŽ) , ๐œ cauchy+ (0, 1) , ๐€ cauchy+ (๐ŸŽ, ๐Ÿ) , ๐œฝ ๐’ฉ(0, ๐œdiag (๐€)) ๐‘ฆ[๐‘›] stduent-t (๐‘‘, ๐œƒ1๐‘ฆ[๐‘› 1] + ๐œƒ2๐‘ฆ[๐‘› 2] + + ๐œƒ๐‘ƒ๐‘ฆ[๐‘› ๐‘ƒ], ๐œŽ) , where ๐‘‘is the degrees-of-freedom for the Student-t likelihood, cauchy+ is a half-Cauchy prior. For the dataset, we use the long-term electrocardiogram measurements of Jager et al. (2003) obtained from Physionet (Goldberger et al., 2000). The data instance we used has a duration of 23 hours sampled at 250 Hz with 12-bit resolution over a range of 10 millivolts. During the experiments, we observed that the hyperparameters suggested by Christmas & Everson are sensitive to the signal amplitude. Therefore, we scaled the signal amplitude to be 10. 1https://datahub.io/sports-data/atp-world-tour-tennis-data H Additional Experimental Results Base Stepsize (ฮฑ) 10 4 10 3 0.01 0.1 Iteration (T) 1 25k 50k (a) LME-election Base Stepsize (ฮฑ) 10 4 10 3 0.01 0.1 Iteration (T) 1 25k 50k (b) LME-radon Base Stepsize (ฮฑ) 10 4 10 3 0.01 0.1 Iteration (T) 1 25k 50k (c) BT-tennis Base Stepsize (ฮฑ) 10 4 10 3 0.01 0.1 Iteration (T) 1 25k 50k (d) LR-keggu Figure 5: BBVI convergence speed (ELBO v.s. Iteration) and robustness against stepsize (ELBO at ๐‘‡= 50, 000 v.s. Base stepsize). The error bands are the 80% quantiles estimated from 20 (10 for AR-eeg) independent replications. The initial point was ๐’Ž0 = ๐ŸŽ, ๐‘ช0 = ๐ˆ. Base Stepsize (ฮฑ) 10 4 10 3 0.01 0.1 Iteration (T) 1 25k 50k (a) LR-song Base Stepsize (ฮฑ) 10 4 10 3 0.01 0.1 Iteration (T) 1 25k 50k (b) LR-buzz Base Stepsize (ฮฑ) 10 4 10 3 0.01 0.1 Iteration (T) 1 25k 50k (c) LR-electric Base Stepsize (ฮฑ) 10 4 10 3 0.01 0.1 Iteration (T) 1 25k 50k Figure 6: BBVI convergence speed (ELBO v.s. Iteration) and robustness against stepsize (ELBO at ๐‘‡= 50, 000 v.s. Base stepsize). The error bands are the 80% quantiles estimated from 20 (10 for AR-eeg) independent replications. The initial point was ๐’Ž0 = ๐ŸŽ, ๐‘ช0 = ๐ˆ.