# globally_convergent_variational_inference__947369f7.pdf Globally Convergent Variational Inference Declan Mc Namara Jackson Loper Jeffrey Regier Department of Statistics University of Michigan {declan, jaloper, regier}@umich.edu In variational inference (VI), an approximation of the posterior distribution is selected from a family of distributions through numerical optimization. With the most common variational objective function, known as the evidence lower bound (ELBO), only convergence to a local optimum can be guaranteed. In this work, we instead establish the global convergence of a particular VI method. This VI method, which may be considered an instance of neural posterior estimation (NPE), minimizes an expectation of the inclusive (forward) KL divergence to fit a variational distribution that is parameterized by a neural network. Our convergence result relies on the neural tangent kernel (NTK) to characterize the gradient dynamics that arise from considering the variational objective in function space. In the asymptotic regime of a fixed, positive-definite neural tangent kernel, we establish conditions under which the variational objective admits a unique solution in a reproducing kernel Hilbert space (RKHS). Then, we show that the gradient descent dynamics in function space converge to this unique function. In ablation studies and practical problems, we demonstrate that our results explain the behavior of NPE in non-asymptotic finite-neuron settings, and show that NPE outperforms ELBO-based optimization, which often converges to shallow local optima. 1 Introduction In variational inference (VI), the parameters η of an approximation to the posterior Q(Θ; η) are selected to optimize an objective function, typically the evidence lower bound (ELBO) (Blei et al., 2017). However, the ELBO is generally nonconvex in η, even for simple variational families such as the family of Gaussian distributions, and so only convergence to a local optimum of the ELBO can be guaranteed (Ghadimi and Lan, 2015; Ranganath et al., 2014; Hoffman et al., 2013). As the number of such optima and the degree of suboptimality of each are generally unknown, the lack of global convergence guarantees constitutes a significant complication for practitioners and a longstanding barrier to the broader adoption of VI. In this work, we present the first global convergence result for variational inference. We accomplish this in the context of an increasingly popular alternative objective for variational inference, the expected forward KL divergence: LP (ϕ) := EP (X)KL [P(Θ | X) || Q(Θ; f(X; ϕ))] . (1) Here, P(X) denotes a marginal of the model and P(Θ | X) denotes the posterior. For each x X, the approximation Q(Θ; η) to P(Θ | X = x) is indexed by the distributional parameters η Y Rq, which are themselves the output of a neural network f(x; ϕ) with weights ϕ Φ. Approximating a posterior distribution by minimizing LP has a long history (Section 2.1), and is sometimes known as neural posterior estimation (NPE) (Papamakarios and Murray, 2016) and the sleep objective of reweighted wake-sleep (Bornschein and Bengio, 2015; Le et al., 2019). Minimization of this objective is straightforward: computing unbiased gradients requires only sampling θ, x P(Θ, X) from the 38th Conference on Neural Information Processing Systems (Neur IPS 2024). joint model (Section 2.1), which is readily accomplished by ancestral sampling. This approach is likelihood-free in that the density of P(X | Θ) need not be evaluated, and therefore expected forward KL minimization is more widely applicable than ELBO-based optimization, which requires a tractable likelihood function. Analysis of the amortized problem (i.e., optimizing an objective that averages over P(X)) is beneficial when considering the forward KL; for the non-amortized problem in which a single observation x is considered, only biased estimates of the gradient of the forward KL can be obtained using self-normalized importance sampling, making convergence difficult to establish (Bornschein and Bengio, 2015; Le et al., 2019; Owen, 2013). Our analysis considers a functional form of variational objective LP , given by LF (f) := EP (X)KL [P(Θ | X) || Q(Θ; f(X))] , (2) where LF : H R is defined over a general reproducing kernel Hilbert space of functions H. We refer to (1) as the parametric objective , as its argument is the parameters ϕ Φ, and we refer to (2) as the functional objective as its argument is a function f H. These objectives are closely related: under a given network parameterization, provided f( ; ϕ) H, we have LP (ϕ) = LF (f( ; ϕ)). The objective LP has been considered in several related works (see Section 2.1). The formulation of LF and the analysis of its minimizer relative to those of LP is our main contribution. We first demonstrate strict convexity of the functional objective LF when Q is parameterized as an exponential family distribution (Section 3). This implies the existence of a unique global optimizer f of LF for a large class of variational families. Afterward, we analyze kernel gradient flow dynamics using the neural tangent kernel to show that minimization of LP results (asymptotically) in an empirical mapping f that is at most ϵ-suboptimal relative to f , provided a sufficiently flexible neural network is used to parameterize f (Section 4). Together, these results imply that in the infinite-width limit, optimization of LP by gradient descent recovers a unique global solution. Our analysis relies on fairly mild conditions, the most important of which are the positive definiteness of the neural tangent kernel and the structure of the variational family (e.g., an exponential family) (Section 6). Our proofs further assume a two-layer Re LU network architecture, but we conjecture that this assumption can be relaxed, and our experiments (Section 5) demonstrate global convergence for a wide variety of architectures. We illustrate that the minimization of LP converges to a global solution for problems with both synthetic and semi-synthetic data, and that finite network widths exhibit the behavior of the asymptotic regime (i.e., that of an infinitely wide network). We further show that optimizing LP can produce better posterior approximations than likelihoodbased ELBO methods, which suffer from convergence to shallow local optima. These results suggest, surprisingly, that a likelihood-free approach to inference can outperform likelihood-based approaches. Further, even for practitioners interested in inference for a single observation x0, for whom amortization is not needed for computational efficiency, our approach may still be preferable to traditional ELBO-based inference due to the convergence guarantees of the former. Related work. There is a large body of literature that analyzes the convergence of variational inference methods that target the ELBO. These works typically prove rates of convergence to the posterior (Zhang and Gao, 2020) or to a local optimum of the objective, as the ELBO is not amenable to global minimization because it is nonconvex (Domke, 2020; Domke et al., 2023; Kim et al., 2023). Liu et al. (2023) demonstrated nonconvexity of the ELBO in the context of small-object detection, and showed empirically that the expected forward KL was more robust to the pitfalls of numerical optimization. The nonconvexity of the variational objective has previously been addressed through workarounds such as convex relaxations (Fazelnia and Paisley, 2018) or iterated runs of the optimization routine to improve the quality of the local optimum (Altosaar et al., 2018). Our work differs from previous work in that our convergence result is global. Additionally, our approach is novel compared to related analyses because we consider the (arguably) more complicated problem of amortized inference, where the variational parameters are the weights of a neural network and are shared among observations. 2 Background 2.1 The Expected Forward KL Divergence The expected forward KL objective is equivalent to the sleep-phase objective of Reweighted Wake Sleep (RWS) (Bornschein and Bengio, 2015), and to the objective optimized by forward amortized variational inference (FAVI) (Ambrogioni et al., 2019). It is also a special case of the thermodynamic variational objective (TVO) (Masrani et al., 2019). Similar objectives have been referred to as neural posterior estimation (NPE) (Papamakarios and Murray, 2016; Papamakarios et al., 2019), though in these works the prior distribution, and thus the marginal P(X), mutates during training. Objectives based on the forward KL divergence generally result in variational posteriors that are overdispersed, a desirable property compared to reverse KL-based optimization (Le et al., 2019; Domke and Sheldon, 2018). Unbiased gradient estimation for the parametric objective LP is straightforward. The outer expectation over P(X) allows for gradients to be computed as ϕEP (X)KL [P(Θ | X) || Q(Θ; f(X; ϕ))] = EP (Θ)EP (X|Θ) ϕ log q(Θ; f(X; ϕ)), where q is the density of Q with respect to Lebesgue measure (see Appendix B for details). Rewriting the left-hand side as an expectation over P(Θ) and P(X | Θ) by Bayes rule illustrates that samples can be drawn from this model by ancestral sampling of Θ followed by X. Other methods targeting the forward KL, such as the wake-phase of RWS, often optimize over a different expectation, typically EX DKL [P(Θ | X) || Q(Θ; f(X; ϕ))]. Here, the outer expectation is over an empirical dataset D rather than P(X). In this case, approximation techniques such as importance sampling are required to estimate the gradient, as sampling from P(Θ | X = x) is intractable for any x. Relying on importance sampling results in biased gradient estimates, with which stochastic gradient descent (SGD) may not converge (Bornschein and Bengio, 2015; Le et al., 2019). 2.2 The Neural Tangent Kernel A neural network architecture and the parameter space Φ of its weights together define a family of functions {f( ; ϕ) : ϕ Φ}. Let ℓ(x, f(x)) denote a general real-valued loss function and consider selecting the parameters ϕ to minimize EP (X)ℓ(X, f(X; ϕ)), where P(X) is a distribution on the data space X. The neural tangent kernel (NTK) (Jacot et al., 2018) analyzes the evolution of the function f( ; ϕ) while ϕ is fitted by gradient descent to minimize the above objective. Continuoustime dynamics are used in the formulation; ϕ(t) and f( ; ϕ(t)) are defined for continuous time t. The parameters ϕ thus follow the ODE ϕ(t) = ϕEP (X)ℓ(X, f(X; ϕ(t))). (3) Here, ϕ denotes the derivative with respect to t, and by the chain rule, the function values f(x; ϕ(t)) evolve via f(x; ϕ(t)) = EP (X) Jϕf(x; ϕ(t))Jϕf(X; ϕ(t)) | {z } NTK ℓ (X, f(X; ϕ(t))). We define ℓ (X, f(X)) := fℓ(X, f(X)) to simplify the notation. The product of Jacobians above is known as the neural tangent kernel (NTK): Kϕ(x, x ) = Jϕf(x; ϕ)Jϕf(x ; ϕ) . (4) The seminal work of Jacot et al. (2018) defined and studied this kernel and established the convergence of Kϕ to a limiting kernel for certain neural network architectures as the layer width grows large. 2.3 Vector-Valued Reproducing Kernel Hilbert Spaces Most existing NTK-based analyses consider neural networks with scalar outputs and squared error loss. Instead, we consider neural networks with multivariate outputs to accommodate the multidimensional distributional parameter η, which parameterizes our variational distribution Q(Θ; η). Furthermore, we consider the objective functions LP and LF . Consequently, we rely on results from the vector-valued reproducing kernel Hilbert space (RKHS) literature, as these spaces contain vector-valued functions such as the network function f( ; ϕ). Carmeli et al. (2008, 2006) provide a detailed review, and in the following, we summarize the key properties. Recall that η = f( ; ϕ) and η Rq. An Rq-valued kernel on X is a map K : X X Rq q. The neural tangent kernel (4) is precisely such a kernel. If the Rq-valued kernel K is positive definite, then K defines a unique Hilbert space of functions H whose elements are maps from X to Rq (Carmeli et al., 2006) . This kernel K is called the reproducing kernel, and the corresponding space of functions is the RKHS associated with the kernel K. In an Rq-valued RKHS, the reproducing property takes on a more general form. For any x X, f H, and v Rq, f(x) v = f(x), v Rq = f( ), K( , x)v H, where , H is the inner product of the Hilbert space H. For any fixed x X and v Rq, K( , x)v is a function mapping from X 7 Rq, as is required to be an element of H. 3 Convexity of the Functional Objective We now turn to the analysis of the functional objective LF given in Equation (2). We fix an RKHS H over which to minimize LF for now, specializing to the particular choice of H based on the neural tangent kernel subsequently. Let ℓ(x, f(x)) = KL [P(Θ | X = x) || Q(Θ; f(x))]. The functional LF then has the form LF = EP (X)ℓ(X, f(X)); we will use this form subsequently for our neural tangent kernel analysis. Our first result shows that targeting LF is highly desirable theoretically: LF admits a unique global minimizer if the variational family Q is an exponential family, as is common practice in VI. Lemma 1. Suppose that Q(Θ; η) is an exponential family distribution in minimal representation with natural parameters η, sufficient statistics T(θ), and density q(θ; η) with respect to Lebesgue measure λ(Θ). Then, for any observation x X Rd, the loss function ℓ(x, η) = KL [P(Θ | X = x) || Q(Θ; η)] is strictly convex in η, provided that P(Θ | X = x) Q(Θ; η) λ(Θ) for all η Y Rq. A proof of Lemma 1, which follows quickly from the convexity of the log partition function in the natural parameter (Wainwright and Jordan, 2008, Proposition 3.1), is provided in Appendix A. Lemma 1 shows the strict convexity of the function ℓin η. This implies the strict convexity of the functional LF (f) = EP (X)ℓ(X, f(X)) in f by the linearity of expectation, which in turn implies the existence of at most one global minimizer. Corollary 1. Suppose that Q(Θ; η) is an exponential family distribution. Then, under the conditions of Lemma 1, the functional objective LF (f) := EP (X)KL [P(Θ | X) || Q(Θ; f(X))] is strictly convex in f. Consequently, the set of global minimizers of LF is either a singleton set or empty. We will assume that the set of global minimizers is nonempty (so that the minimization of LF is well-posed) and let f denote the global minimizer. We also assume that ||f ||H < so that f H. Hereafter, we use the term unique to mean unique almost everywhere with respect to P(X). Furthermore, in a slight abuse of notation, f will denote the unique equivalence class of functions that minimize LF (f). Whereas Lemma 1 establishes the convexity of the (non-amortized) forward KL divergence, Corollary 1 establishes the convexity of LF , an amortized objective, in function space. The convexity of the functional objective LF holds regardless of the distribution chosen for the outer expectation by the same linearity argument. Choices other than P(X), however, may not permit unbiased gradient estimation, as is the case for the wake-phase updates of RWS (Section 2.1). 4 Global Optima of the Parametric Objective In practice, we must directly minimize LP rather than LF , as optimizing the latter over the infinitedimensional space H directly is not tractable. Thus, in the second phase of our analysis, we consider convergence to f by minimizing the parametric objective LP with gradient descent. We define ϕ across continuous time as in Equation (3). Continuous-time dynamics simplify theoretical analysis; SGD with unbiased gradients follows a (noisy) Euler discretization of the continuous ODE (Santambrogio, 2017; Yang et al., 2021). Considering X P(X) for the outer expectation in both LP and LF is key in this context: this choice enables unbiased stochastic gradient estimation for LP (see Appendix B), whereas other choices require approximations that result in biased gradient estimates (see Section 2.1) and thus follow different gradient dynamics. Analysis of the trajectories of the parametric objective LP throughout its minimization initially seems infeasible: the argument of this objective is the neural network parameters ϕ, and even well-behaved loss functions such as the mean squared error (MSE) are nonconvex in these parameters. Nevertheless, neural tangent kernel (NTK)-based results enable analysis of LP . We bridge the divide between the minimizers of the convex functional LF and the nonconvex objective LP using the limiting kernel, and show that in the large-width limit, the optimization path of LP converges arbitrarily close to f , the unique minimizer of the functional objective LF . Theorem 1. Consider the width-p scaled 2-layer Re LU network, evolving via the flow ft(x) = EP (X)Kp ϕ(t)(x, X)ℓ (X, ft(X)), (5) where ft denotes f( , ϕ(t)). Let f denote the unique minimizer of L = LF from Lemma 1, and fix ϵ > 0. Then, under conditions (C1) (C4), (D1) (D4), and (E1) (E5), there exists T > 0 such that almost surely lim p L(f T ) L(f ) + ϵ. (6) Regularity conditions (C1) (C4), (D1) (D4), and (E1) (E5) are provided in Appendices C, D, and E, respectively. We consider a scaled two-layer Re LU network architecture (further detailed in Appendix C) and use this simple architecture to prove the results as the network width p tends to infinity. Our results may also be extended to multilayer perceptrons with other activation functions. Below, we briefly sketch the key ingredients needed to prove Theorem 1. Recall the NTK Kp ϕ from Equation (4), where we now let p denote the network width. For certain neural network architectures, Jacot et al. (2018) show that as the network width p tends to infinity, the neural tangent kernel becomes stable and tends (pointwise) towards a fixed, positive-definite limiting neural tangent kernel K . Under suitable positivity conditions on the limiting kernel, we take the domain H of LF to be the RKHS with kernel K (Section 2.3). Because LF has a unique minimizer f , under mild conditions on K , f may be characterized as the solution obtained by following kernel gradient flow dynamics in H, that is, the ODE given by ft(x) = EP (X)K (x, X)ℓ (X, ft(X)). In other words, starting from some function f0, following the limiting NTK gradient flow dynamics above minimizes the functional objective LF for sufficiently large T. Lemma 2. Let f denote the minimizer of LF from Lemma 1, and ϵ > 0. Fix f0, and let K denote the limiting neural tangent kernel. Let f0 evolve according to the dynamics ft(x) = EP (X)K (x, X)ℓ (X, ft(X)). Suppose that the conditions of Lemma 1 and (E1)-(E3) hold. Then, there exists T > 0 such that L(f T ) L(f ) + ϵ, where L is the loss functional of LF . Appendix E enumerates regularity conditions (E1) (E3) and provides a proof of Lemma 2. The characterization of f in Lemma 2 clarifies how the analysis of the parametric objective LP will proceed. The gradient flow LP causes the network function to similarly evolve according to a kernel gradient flow via the empirical neural tangent kernel, that is, f(x; ϕ(t)) = EP (X)Kp ϕ(t)(x, X)ℓ (X, f(X; ϕ(t))), as derived in Section 2.2. Comparison of the minimizers of LP and LF can be accomplished by comparing the two gradient flows above, i.e. kernel gradient flow dynamics that follow Kp ϕ(t) and K , respectively. As Kp ϕ(t) K (Appendix D), these trajectories should not differ greatly: for any fixed T, the functions obtained by following the kernel gradient dynamics with Kp ϕ(t) and K can be made arbitrarily close to one another, provided p is sufficiently large. The proof of Theorem 1 first selects a T using Lemma 2, and then bounds the difference in the trajectories on [0, T] for sufficiently large width p by convergence of the kernels Kp ϕ(t) K . Our proof differs from previous results in that it relies on uniform convergence of kernels (cf. Appendices C and D), enabling the analysis of population quantities such as EP (X)ℓ(X, f(X)). Theorem 1 proves convergence to an ϵ-neighborhood of the global solution when optimizing LP despite the highly nonconvex nature of this optimization problem in the network parameters ϕ. For sufficiently flexible network architectures, optimization of LP thus behaves similarly to that of LF , which we have shown is a convex problem in the function space H in Section 3. 5 Experiments Having established conditions under which global convergence is guaranteed, the main aim of our experiments is to demonstrate approximate global convergence in practice, even for scenarios where the conditions assumed in our proofs are not satisfied exactly. Section 5.1 demonstrates that finite-neuron layer widths used in practice approximate the limiting behavior well, while Section 5.2 and Section 5.3 utilize problem-specific network architectures for amortized inference. Our results suggest that there may exist weaker assumptions under which global convergence is still guaranteed. 5.1 Toy Example We first assess whether the asymptotic regime of Theorem 1 is relevant to practice with layers of finite width. We use a diagnostic motivated by the lazy training framework of Chizat et al. (2019), which provides the intuition that in the limiting NTK regime, the function f behaves much like its linearization around the initial weights ϕ0: f(x; ϕ) f(x; ϕ0) + Jϕf(x; ϕ0)(ϕ ϕ0). (7) Liu et al. (2020) prove that equality holds exactly in the equation above if and only if f(x; ϕ) has a constant tangent kernel (i.e., K ). Therefore, similarity between f and its linearization indicates that the asymptotic regime of the limiting NTK is approximately achieved. Note that even if f is linear in ϕ, as in the above expression, it may still be highly nonlinear in x. We consider a toy example for which ||x||2 = 1. The generative model first draws a rotation angle Θ uniformly between 0 and 2π, and then a rotation perturbation Z N(0, σ2), where we take σ = 0.5. Conditional on Θ and Z, the data x is deterministic: x = [cos(θ + z), sin(θ + z)] . This construction ensures that the data lie on the sphere S1 R2, which guarantees the positivity of the limiting NTK for certain architectures (Jacot et al., 2018). We aim to infer Θ given a realization x, marginalizing over the nuisance latent variable Z. Our variational family Q(Θ; f(x)) is a von Mises distribution, whose support is the interval [0, 2π]. This family is an exponential family distribution, allowing for the application of Lemma 1. The encoder network f( ; ϕ) is given by a dense two-layer network (that is, one hidden layer) with rectified linear unit (Re LU) activation, which we study as the network width p grows. The network outputs f(x; ϕ) parameterize the natural parameter η. We fit the neural network f(x; ϕ) in two ways. First, we use SGD to fit the network parameters ϕ to minimize LP . Second, we fit the linearization flin(x; ϕ) = f(x; ϕ0) + Jϕf(x; ϕ0)(ϕ ϕ0) in ϕ. We perform both of these fitting procedures for various widths p. For both settings, stochastic gradient estimation was performed by following the procedure in Appendix B. For evaluation, we fix N = 1000 independent realizations x 1, . . . , x N from the generative model with underlying ground-truth latent parameter values θ 1, . . . , θ N, and evaluate the held-out negative log-likelihood (NLL), 1 N PN i=1 log q (θ i | f(x i ; ϕ)), for both functions: f(x; ϕ) and flin(x; ϕ). Figure 1 shows the evolution of the held-out NLL across the fitting procedure for three different network widths p: 64, 256 and 1024. The difference in quality between the linearizations and the true functions at convergence diminishes as the width p grows; for p = 1024, the two are nearly identical, providing evidence that the asymptotic regime is achieved. Figure 1: Negative log-likelihood across gradient steps, for network widths 64, 256, and 1024 neurons. NLL for the exact posterior is denoted by the red line. 5.2 Label Switching in Amortized Clustering In this experiment and the subsequent one, we consider a more diverse set of network architectures. Although Theorem 1 assumes a shallow, dense network architecture, these experimental results show that empirically, global convergence can be obtained for many other architectures as well. We consider the difficult problem of amortizing clustering. In this problem, we are given an unordered collection (set) of data x = {x1, . . . , xn}, xi R and our objective is to infer the locations and labels of the cluster centers from which the data were generated. The generative model first draws a shift parameter S N(0, 1002) followed by Z | (S = s) N [µ1 + s, . . . , µd + s] , σ2Id Xi | (Z = z) iid j=1 pj N(zj, τ 2). In the above, σ2, τ 2, µ Rd and p Rd are known variances, locations, and proportions, respectively. The global parameter S = s R shifts the d locations µ1, . . . , µd to µ1 + s, . . . , µd + s. The cluster centers Z are then obtained by adding noise to these locations. An implicit labeling is imposed on the centers by assigning each a dimension in Rd via the prior. Finally, the data are drawn independently from a mixture of d univariate Gaussians with centers Zi, i = 1, . . . , d. We consider the tasks of inferring the scalar shift S and the vector of cluster centers Z. We fix µ = [ 20, 10, 0, 10, 20] with d = 5. We fix the hyperparameters σ = 0.5 and τ = 0.1, and artificially fix the shift as S = 100 to generate n = 1000 independent realizations X = x from the generative model. Inferring S should be straightforward because the vector µ is known and fixed, ensuring that the joint likelihood p(x, s) (marginalizing over z) is unimodal in s. However, inference on Z may be difficult for likelihood-based methods because the order of the entries of Z matters: the joint density p(x, z, s) = p(x, π(z), s) for a permutation π, even though p(x | z) = p(x | π(z)). Label switching can thus pose a significant obstacle for likelihood-based methods, as any permutation of the cluster centers Z will still explain the data well. This problem formulation thus results in a likelihood, and hence a posterior density, with many local optima but a single global optimum (i.e., where the cluster centers have the correct labeling). Now we show that likelihood-based approaches to variational inference, such as maximizing the evidence lower-bound (ELBO), result in suboptimal solutions compared to the minimization of LP . We take the variational distributions q(S; f1(x; ϕ1)) and q(Z; f2(x; ϕ2)) to be Gaussian. We fit the networks f1 and f2. Due to the exchangeability of the observations x = {x1, . . . , xn}, we parameterize each as permutation-invariant neural networks, fitting both ϕ1 and ϕ2 to either minimize LP or to maximize the ELBO (see Appendix F). We perform 100 replications of this experiment across different random seeds, and consider two different parameterizations of the Gaussian variational distribution: a mean-only parameterization with fixed unit variance and a natural parameterization with an unknown mean and unknown variance. Both of these variational families are exponential families, and so convexity (in the sense of Corollary 1) holds. Figure 2 plots kernel-smoothed frequencies of point estimates of S, where each point estimate is the mode of the variational posterior for an experimental replicate. Both ELBOand LP -based training estimate S = 100 well. However, the limitations of ELBO-based training are evident in the fidelity of the posterior approximation of Z. Table 2 plots the average ℓ1 distance || ˆZ Z||1 between the variational mode ˆZ and the true latent draw Z. Optimization of the ELBO converges to a local optimum that is a permutation of the entries of Z, resulting in a large ℓ1 distance on average. Minimization of LP , on the other hand, converges to the global optimum without any label switching. Table 1 indicates the degree of label switching, showing the proportion of trials in which the entries ˆZ were correctly ordered. ELBO LP Mean 0.03 1.00 Natural 0.02 1.00 Table 1: Proportion out of one hundred replicates where posterior mode of q(Z; f2(x; ϕ2)) was a vector in increasing order. ELBO LP Mean 77.0 (45.2) 1.8 (2.3) Natural 86.0 (26.9) 2.9 (0.8) Table 2: Average ℓ1 distance || ˆZ Z||1 (and std. deviations) across one hundred replicates. Figure 2: Mode of q(S; f1(x; ϕ1)) across experimental replicates. 5.3 Rotated MNIST Digits We consider the task of inferring a shared rotation angle Θ for a set of N MNIST digits. The generative model is as follows. The rotation angle is drawn as Θ Uniform(0, 2π), and for all i [N] a noise term is drawn as Zi N(0p, σ2Ip). Finally, each image is drawn as Xi | (Zi = zi, Θ = θ) N Rotate MNIST(zi, θ), τ 2 , i [N]. (8) Here, Rotate MNIST is fitted ahead of time and fixed throughout this experiment. Given a latent representation z and angle θ, Rotate MNIST returns a 28 28 MNIST digit image rotated counterclockwise by θ degrees. (See Appendix F for additional experimental details.) We aim to fit the variational posterior q(Θ; f(x1, . . . , x N; ϕ)), implicitly marginalizing over the nuisance latent variables Z1, . . . , ZN. The true data {xi}N i=1 are generated from the above model, with N = 1000 digits and an underlying counterclockwise rotation angle of θ = 260 degrees. We provide visualizations of some of these digits in Figure 3. The variational distribution q(Θ; f(x1, . . . , x N; ϕ)) is taken to be a von Mises distribution with natural parameterization, as in Section 5.1, although for this example the architecture of the encoder network makes f invariant to permutations of the inputs {xi}N i=1, to reflect the exchangeability of the data. We fit f to minimize LP using 100,000 iterations of SGD. We compare to fitting θ to directly maximize the likelihood of the data p(θ, {xi}N i=1). As this quantity is intractable, we maximize the importance-weighted bound (IWBO), which is a generalization of the ELBO (Burda et al., 2016), by maximizing the joint likelihood p(θ, {zi}N i=1, {xi}N i=1) in θ while fitting a Gaussian variational distribution on the variables Z. The likelihood function is multimodal in θ due to the approximate rotational symmetry of several handwritten digits: zero, one, and eight are approximately invariant under rotations of 180 degrees. Additionally, the digits six and nine are similar following 180-degree rotations. These symmetries yield a multimodal posterior distribution on Θ. Likelihood-based fitting procedures, such as maximizing the IWBO, often get stuck in these shallow local optima, while fitting f to minimize the parametric objective LP finds a unique global solution. Figure 4 shows estimates of the angle θ conditional on the data {xi}N i=1 during training with the IWBO objective, with θ initialized to a Figure 3: 100 of the N = 1000 data observations with counterclockwise rotation of 260 degrees. Figure 4: Estimate of angle θ across gradient steps, with fitting performed to maximize the IWBO. variety of values. For some initializations, the IWBO optimization converges quickly to near the correct value of 260 degrees, but in many others, it converges to a shallow local optimum. We perform the same routine, fitting q(Θ; f(x1, . . . , x N; ϕ)) to minimize the expected forward KL divergence LP . Figure 5 shows that across a variety of initializations of the angles, this approach always converges to a unique solution and fits the posterior mode to the correct value of 260 degrees. LP minimization converges rapidly, and so Figure 6 zooms in on the initial few thousand gradient steps to show the various trajectories among initializations. Figure 5: Mode of q(Θ; f(x1, . . . , x N; ϕ)) across training (starting at different initializations) when minimizing objective LP . Figure 6: Zoomed-in trajectories across the first 2000 gradient steps, showing similar estimates regardless of initialization. 5.4 Local Optima vs. Global Optima In this experiment, we directly compare the quality of the variational approximation minimizing the expected forward KL objective to that of the local optima found by optimizing the ELBO. We consider an adapted version of the rotated MNIST digit problem outlined above. Each digit xi is drawn from the model xi N Rotate x0 i , θ , τ 2 for i = 1, . . . , 50 with angle set to θtrue = 260 degrees. The unrotated digits x0 i are fixed a priori, eliminating the nuisance latent variables Z in this setting. This allows us to directly perform ELBO-based variational inference on Θ, instead of merely maximizing the likelihood (or a bound thereof) as in Section 5.3. We fit an amortized Gaussian variational distribution with fixed variance σ2 = 0.52 for inference on Θ, and use the same prior as above. This is an exponential family with only one location parameter to be learned. Fitting q(Θ; f(x1, . . . , x50; ϕ)) is performed by minimizing either the negative ELBO or the expected forward KL divergence. The only differences between the two approaches are their objective functions: the data, network architecture, learning rates, and all other hyperparameters are the same. We optimize each objective function over 10,000 gradient steps, and measure the quality of the obtained variational approximations by a variety of metrics, including: the (non-expected) forward KL divergence; the reverse KL divergence; negative log-likelihood; and the angle point estimate. Intuition suggests that a global optimum should outperform local ones, and we find this to be the case. By any of the performance metrics we consider, the global solution of the expected forward KL minimization outperforms the variational approximations found by optimizing the ELBO. Figure 7: Forward and reverse KL divergences to the true posterior across fitting for minimization of the expected forward KL (blue) or the negative ELBO (green). We also plot the negative log likelihood of the true angle, as well as the variational mode (true angle θtrue is plotted in red.) 6 Discussion In this work, we showed that in the asymptotic limit of an infinitely wide neural network, gradient descent dynamics on the expected forward KL objective LP converge to an ϵ-neighborhood of a unique function f , a global minimizer. Our results depend on several regularity conditions, the most important of which is the positive definiteness of the limiting neural tangent kernel, the compactness of the data space X, and the specific architecture considered, a single hidden-layer Re LU network. We conjecture and show experimentally that global convergence holds more generally. First, we illustrate that the asymptotic regime describes practice with a finite number of neurons (see Section 5.1). Second, our conditions allow for a wide variety of activation functions beyond Re LU (see Appendix D). Third, empirically, global convergence can be achieved with many network architectures. Beyond multilayer perceptrons, we illustrate global convergence for convolutional neural networks (CNNs) and permutation-invariant architectures in Section 5.2 and Section 5.3. Expected forward KL minimization is a likelihood-free inference (LFI) method. For Bayesian inference, likelihood-based and likelihood-free methods are not typically viewed as competitors, but as different tools for different settings. In a setting where the likelihood is available, the prevailing wisdom suggests utilizing it. However, our results suggest that likelihood-free approaches to inference may be preferable even when the likelihood function is readily available. We find likelihood-based methods are prone to suffer the shortcomings of numerical optimization, often converging to shallow local optima of ELBO-based variational objectives. Expected forward KL minimization instead converges to a unique global optimum of its objective function. There are situations in which ELBO optimization may nevertheless be preferable. First, if the likelihood function of the model is approximately convex and well-conditioned in the region of interest, ELBO optimization should recover a nearly global optimizer. Second, in certain situations, the ELBO can be optimized using deterministic optimization methods, which can be much faster than SGD. Third, if the generative model has free model parameters, with the ELBO they can be fitted while simultaneously fitting the variational approximation, with a single objective function for both tasks. Fourth, the ELBO can be applied with non-amortized variational distributions, which can have computational benefits in settings with few observations. Many important inference problems do not fall into any of these four categories. Even for those that do, the benefits of expected forward KL minimization, including global convergence, may outweigh the benefits of ELBO optimization. Acknowledgments We thank the reviewers for their helpful comments and suggestions. This material is based on work supported by the National Science Foundation under Grant Nos. 2209720 (OAC) and 2241144 (DGE), and the U.S. Department of Energy, Office of Science, Office of High Energy Physics under Award Number DE-SC0023714. Altosaar, J., Ranganath, R., and Blei, D. (2018). Proximity variational inference. In International Conference on Artificial Intelligence and Statistics. Ambrogioni, L., Güçlü, U., Berezutskaya, J., van den Borne, E., Güçlütürk, Y., Hinne, M., Maris, E., and van Gerven, M. (2019). Forward amortized inference for likelihood-free variational marginalization. In International Conference on Artificial Intelligence and Statistics. Ba, J., Erdogdu, M., Suzuki, T., Wu, D., and Zhang, T. (2020). Generalization of two-layer neural networks: An asymptotic viewpoint. In International Conference on Learning Representations. Blei, D. M., Kucukelbir, A., and Mc Auliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877. Bornschein, J. and Bengio, Y. (2015). Reweighted wake-sleep. In International Conference on Learning Representations. Burda, Y., Grosse, R. B., and Salakhutdinov, R. (2016). Importance weighted autoencoders. In International Conference on Learning Representations. Carmeli, C., De Vito, E., and Toigo, A. (2006). Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4(4):377 408. Carmeli, C., Vito, E. D., Toigo, A., and Umanità, V. (2008). Vector valued reproducing kernel Hilbert spaces and universality. Chizat, L., Oyallon, E., and Bach, F. (2019). On lazy training in differentiable programming. In Neural Information Processing Systems. Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In International Conference on Machine Learning. Domke, J., Gower, R. M., and Garrigos, G. (2023). Provable convergence guarantees for black-box variational inference. In Neural Information Processing Systems. Domke, J. and Sheldon, D. R. (2018). Importance weighting and variational inference. In Neural Information Processing Systems. Dragomir, S. S. (2003). Some Gronwall Type Inequalities and Applications. Nova Science Publishers. Fazelnia, G. and Paisley, J. (2018). CRVI: Convex relaxation for variational inference. In International Conference on Machine Learning. Ghadimi, S. and Lan, G. (2015). Stochastic approximation methods and their finite-time convergence properties. In Handbook of Simulation Optimization, pages 149 178. Springer. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(40):1303 1347. Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Neural Information Processing Systems. Kim, K., Oh, J., Wu, K., Ma, Y., and Gardner, J. R. (2023). On the convergence of black-box variational inference. In Neural Information Processing Systems. Le, T. A., Kosiorek, A. R., Siddharth, N., Teh, Y. W., and Wood, F. (2019). Revisiting reweighted wake-sleep for models with stochastic control flow. In International Conference on Uncertainty in Artificial Intelligence. Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. (2019). Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning. Liu, C., Zhu, L., and Belkin, M. (2020). On the linearity of large non-linear models: when and why the tangent kernel is constant. In Neural Information Processing Systems. Liu, R., Mc Auliffe, J. D., Regier, J., and the LSST Dark Energy Science Collaboration (2023). Variational inference for deblending crowded starfields. Journal of Machine Learning Research, 24(179):1 36. Masrani, V., Le, T. A., and Wood, F. (2019). The thermodynamic variational objective. In Neural Information Processing Systems. Owen, A. B. (2013). Monte Carlo Theory, Methods and Examples. Papamakarios, G. and Murray, I. (2016). Fast ϵ -free inference of simulation models with bayesian conditional density estimation. In Neural Information Processing Systems. Papamakarios, G., Sterratt, D., and Murray, I. (2019). Sequential neural likelihood: Fast likelihoodfree inference with autoregressive flows. In International Conference on Artificial Intelligence and Statistics. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., et al. (2019). Py Torch: An imperative style, high-performance deep learning library. Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In International Conference on Artificial Intelligence and Statistics. Santambrogio, F. (2017). Euclidean, metric, and Wasserstein gradient flows: an overview. Bulletin of Mathematical Sciences, 7(1):87 154. Shapiro, A. (2003). Monte Carlo sampling methods. In Stochastic Programming, volume 10 of Handbooks in Operations Research and Management Science, pages 353 425. Elsevier. Srivastava, M. K., Khan, A. H., and Srivastava, N. (2014). Statistical Inference: Theory of Estimation. PHI Learning. Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1 305. Yang, J., Hu, W., and Li, C. J. (2021). On the fast convergence of random perturbations of the gradient flow. Asymptotic Analysis, 122:371 393. Zhang, F. and Gao, C. (2020). Convergence rates of variational posterior distributions. The Annals of Statistics, 48(4):2180 2207. A Convexity of the Functional Objective We prove Lemma 1 from the manuscript below. Proof. Let the log-density be given by log q(θ; η) = log h(θ) + η T(θ) A(η). First, observe that under the conditions given, the function ℓis equivalent (up to additive constants) to a much simpler expression, the expected log-density of q, via ℓ(x, η) = KL P(Θ|X = x) || Q(Θ; η) = EP (Θ|X=x) log q(θ; η) + C, where C is a constant that does not depend on η. Now, the mapping η log q(θ; η) is convex in η because its Hessian is A η η = Var(T(θ)) 0 (cf. Chapter 6.6.3 of Srivastava et al. (2014)). Strict convexity follows from minimality of the representation (cf. Wainwright and Jordan (2008), Proposition 3.1). We can show ℓis (strictly) convex in η by applying linearity of expectation. We have for any λ [0, 1] ℓ(x, λη1 + (1 λ)η2) = EP (Θ|X=x) log q(Θ; λη1 + (1 λ)η2) (9) EP (Θ|X=x) λ log q(Θ; η1) + (1 λ) log q(Θ; η2) (10) = λℓ(x, η1) + (1 λ)ℓ(x, η2) (11) where the second line follows from the convexity of the map η 7 log q(θ; η) above for any value of θ. In fact, the inequality holds strictly as well by strict convexity above and the continuity of log q(θ; η) in θ. So the function ℓ(x, η) is strictly convex in η. B Unbiased Stochastic Gradients for the Parametric Objective Computation of unbiased estimates of the gradient of the loss function LP (ϕ) with respect to the parameters ϕ is all that is needed to implement SGD for LP . Under mild conditions (see Proposition 1), the loss function L(ϕ) may be equivalently written as LP (ϕ) = EP (X)EP (Θ|X) log p(Θ | X) q(Θ; f(X; ϕ)) = EP (Θ,X) log p(Θ | X) q(Θ; f(X; ϕ)) for density functions p, q, where f( ; ϕ) denotes a function parameterized by ϕ. Under the conditions of Proposition 1, differentiation and integration may be interchanged, so that ϕLP (ϕ) = EP (Θ,X) ϕ log p(Θ | X) q(Θ; f(X; ϕ)) = EP (Θ,X) ϕ log q(Θ; f(X; ϕ)) and unbiased estimates of the quantity can be easily attained by samples drawn (θ, x) P(Θ, X). Proposition 1. Let (Ω1, B1), (Ω2, B2) be measurable spaces on which the random variables X : Ω1 X and Θ : Ω2 O are defined, respectively. Suppose that for all x X and all ϕ Φ we have P(Θ | X = x) Q(Θ; f(x; ϕ)) λ(Θ), with λ(Θ) denoting Lebesgue measure and denoting absolute continuity. Further, suppose that log p(Θ|X) q(Θ;f(X;ϕ)) is measurable with respect to the product space (Ω1 Ω2, B1 B2) for each ϕ Φ, and ϕ log q(θ; f(x; ϕ)) exists for almost all (θ, x) O X. Finally, assume there exists an integrable Y dominating ϕ log q(θ; f(x; ϕ)) for all ϕ Φ and almost all (θ, x) O X. Then for any B N and any ϕ Φ the quantity i=1 ϕ log q(θi; f(xi; ϕ)), (θi, xi) iid P(Θ, X) (12) is an unbiased estimator of the gradient of the objective LP , evaluated at ϕ Φ. Proof. By the absolute continuity assumptions, for any x X the distributions P(Θ | X = x) and Q(Θ; f(x; ϕ)) admit densities with respect to Lebesgue measure denoted p(θ | x) and q(θ; f(x; ϕ)), respectively. We may then rewrite the KL divergence from Equation (1) as KL P(Θ | X = x) || Q(Θ; f(x; ϕ)) := EP (Θ|X=x) log d P(Θ | X = x) d Q(Θ; f(x; ϕ)) = EP (Θ|X=x) log p(Θ | x) q(Θ; f(x; ϕ)) because the Radon-Nikodym derivative d P/d Q is given by the ratio of these densities. Equation (1) is thus equivalent to EP (X)EP (Θ|X) log p(Θ | X) q(Θ; f(X; ϕ)) = EP (Θ,X) log p(Θ | X) q(Θ; f(X; ϕ)) This expectation is well-defined by the measurability assumption on log p(Θ|X) q(Θ;f(X;ϕ)) . To interchange differentiation and integration, it suffices by Leibniz s rule that the gradient of this quantity with respect to ϕ is dominated by a measurable r.v. Y . More precisely, there exists an integrable Y (θ, x) defined on the product space O X such that ϕ log p(θ|x) q(θ;f(x;ϕ)) Y (θ, x) for all ϕ Φ and almost everywhere-P(Θ, X). This is assumed in the statement of the proposition, and so we have ϕEP (Θ,X) log p(Θ | X) q(Θ; f(X; ϕ)) = EP (Θ,X) ϕ log p(Θ | X) q(Θ; f(X; ϕ)) = EP (Θ,X) ϕ log q(Θ; f(X; ϕ)) and the result follows by sampling. The variance of the gradient estimator can be reduced at the standard Monte Carlo rate, and for any B Equation (12) can be used for SGD. C The Limiting NTK Before proceeding, we introduce the architecture specific to our analysis, a scaled two-layer network, and several theorems that we will use throughout the analysis. The first result from Shapiro (2003) concerns optimization of the objective f(x) = Eξ P F(x, ξ) in x via its empirical approximation ˆfn(x) = 1 n Pn i=1 F(x, ξi), ξi iid P. We reproduce this result below. Theorem 2 (Proposition 7 of Shapiro (2003)). Let C be a nonempty compact subset of Rn and suppose that (i) for almost every ξ Ξ the function F( , ξ) is continuous on C, (ii) F(x, ξ), x C, is dominated by an integrable function, (iii) the sample ξ1, . . . , ξn is iid. Then the expected value function f(x) is finite-valued and continuous on C, and ˆfn(x) converges to f(x) with probability 1 uniformly on C. The next two results are integral forms of Gronwall s inequality that we use in subsequent analysis. We refer to Dragomir (2003) for a detailed review and summarize the results therein below. Theorem 3 (Gronwall s Inequality, Corollary 3 of Dragomir (2003)). Let u(t) R be such that u(t) c1 + c2 R t 0 u(s)ds for t > 0 and nonnegative c1, c2. Then u(t) c1 exp[c2t]. Theorem 4 (Theorem 57 of Dragomir (2003)). Let u(t) R be such that u(t) c1 + c2 R t 0 R s 0 u(v)dvds for t > 0 and nonnegative c1, c2. Then u(t) c1 exp[c2t2/2]. Now we turn to specifics of the architecture we consider. Assume the function f has the architecture of a (scaled) two-layer (single hidden layer) network mapping f : X Y with X Rd and Y Rq. We consider this network architecture for a given width p, and study each of the i = 1, . . . , q coordinate functions of f. For a scaled two-layer network, the ith such function is f(x; ϕ)i := 1 p j=1 aijσ x wj for i = 1, . . . , q, where σ denotes an activation function. The scaling depends on the width of the network p. The parameters ϕ are thus ϕ = {wj, a( ),j}p j=1 where a( ),j denotes the vector [a1j, . . . , aqj] (i.e. the jth coefficient for each component function i). The individual parameters have dimensions as follows: wj Rd and a( ),j Rq, for all j = 1, . . . , p, where again p denotes the network width and d the data dimension dim X. For ease hereafter, we write aj = a( ),j to refer to the entire jth vector of second layer network coefficients when the context is clear. As is standard, the first layer parameters are initialized as independent standard Gaussian random variables, i.e. wj iid N(0, Id) for all j = 1, . . . , p. In other related works, the weight aij is sometimes also drawn as aij iid N(0, 1) for all i = 1, . . . , q, j = 1, . . . , p, but in this work, we initialize these second-layer weights to zero for simplicity to ensure that at initialization, f( ; ϕ) = 0. A zero-initialized network function is used for analysis in several related works, e.g. Chizat et al. (2019) and Ba et al. (2020). For now, notationally we denote weights to be initialized as draws from an arbitrary distribution D, and we introduce specificity to D as required. The neural tangent kernel (Equation (4)) can be computed explicitly for this architecture and is given in the lemma below, which proves pointwise convergence to the limiting NTK at initialization as the width p tends to infinity. Lemma 3 (Pointwise Convergence At Initialization). For the architecture above, consider any p. Let a Rq, w Rd be distributed according to a, w D for some distribution D such that a, w are integrable (L1) random variables. Assume X is compact, and σ is bounded. Then, provided condition (C4) holds (see below), we have for any x, x X that Kp ϕ(0)(x, x) a.s. EDK(x, x; w, a) (13) as p where Kp ϕ(0) denotes the NTK at initialization constructed from draws aj, wj iid D and K(x, x; w, a) Rq q is the q q matrix whose k, lth entry is given by 1k=lσ x w σ x w + akalσ x w σ x w x x for k, l = 1, . . . , q. Proof. Consider the kth coordinate function of f. For any choice of p, the gradient is given by ϕfk(x; ϕ) = ak1 ... fk(x;ϕ) akp fk(x;ϕ) w1... fk(x;ϕ) ak1σ x w1 x ... akpσ x wp x where we have imposed an arbitrary ordering on the parameters. In the above, we omitted partial derivatives fk alj for l = k, j = 1, . . . , p because these are all identically zero. From this, it follows that for any fixed x, x X, the k, l-th entry of Kp ϕ(0)(x, x) is given by ϕfk(x; ϕ(0)) ϕfl( x; ϕ(0)) =1 j=1 1k=lσ x wj σ x wj + j=1 akjaljσ x wj σ x wj x x . The existence of the limiting NTK follows immediately: for each of the two terms above, each term is integrable by the compactness of X and domination (see (C4)). It follows that K (x, x) is the q q matrix whose k, lth entry is given by Ew,a D 1k=lσ x w σ x w + akalσ x w σ x w x x with w, a D. Convergence in probability pointwise follows from the weak law of large numbers, and almost sure convergence holds by the strong law of large numbers. K(x, x; a, w) is integrable by the assumption (C4) (see below), so the expectation is well-defined. The proof of the existence and pointwise convergence to the limiting NTK K above is rather straightforward, and this result has been previously established in other works (Jacot et al., 2018). For our analysis of kernel gradient flows in Theorem 1 for the expected forward KL objectives LP and LF , however, we require uniform convergence to K over the entire data space X. We establish conditions under which this uniform convergence holds in two results, Proposition 2 and Proposition 3. Proposition 2, given below, concerns convergence at initialization to the limiting neural tangent kernel K (i.e. before beginning gradient descent). Proposition 3, proven in Appendix D, demonstrates that across a finite training interval [0, T], the NTK changes minimally from its initial value in a large width regime. Generally, we refer to the first result as deterministic initialization and the second as lazy training following related works (Jacot et al., 2018; Chizat et al., 2019). Below, we give suitable regularity conditions and state and prove Proposition 2. (C1) The data space is X is compact. (C2) The distribution D is such that w N(0, Id) and a = 0 w.p. 1. For j = 1, . . . , p iid draws from this distribution, we thus have wj iid N(0, Id) and aij = 0 w.p 1 for all i, j. (C3) The activation function σ is continuous. Under (C2), this implies that the function K( , ; a, w) from Lemma 3 with a, w D is almost surely continuous. (C4) The function K(x, x; a, w) is dominated by some integrable random variable G, i.e. for all x, x X X we have ||K(x, x; a, w)||F G(a, w) almost surely for integrable G(a, w). Proposition 2. Fix a scaled two-layer network architecture of width p, and let Φ denote the corresponding parameter space. Initialize ϕ(0) as independent, identically distributed random variables drawn from the distribution D in (C2). Let Kp ϕ(0) : X X Rq q be the mapping defined by (x, x) 7 Kϕ(0)(x, x) = Jϕf(x; ϕ(0))Jϕf( x; ϕ(0)) . Then, provided conditions (C1) (C4) hold, we have as p that sup x, x X ||Kp ϕ(0)(x, x) K (x, x)||F a.s. 0, (DI) where K (x, x) := plimp Kp ϕ(0)(x, x) is a fixed, continuous kernel. Proof. The proof follows by direct application of Proposition 7 of Shapiro (2003). Precisely, we satisfy i) almost-sure continuity of K( , ; a, w) by (C3), ii) domination by (C4), and iii) the draws comprising Kp ϕ(0) are iid by assumption. By this proposition, then, we have uniform convergence of Kp ϕ(0) to K and obtain continuity of K as well. D Lazy Training Below, we prove several results that will aid in proving the lazy training result of Proposition 3 (see below). Given the same architecture as above in Appendix C and a fixed width p and time T > 0, we will begin by bounding ||wj(T) wj(0)|| and ||akj(T) akj(0)||, ||alj(T) alj(0)|| for all k, l = 1, . . . , q and all j = 1, . . . , p. As in Appendix C, there are several conditions that we impose and use in the following results. (D1) (D2) are identical to (C1) (C2), repeated for clarity. (D1) The data space is X is compact. (D2) The distribution D is such that w N(0, Id) and a = 0 w.p. 1. For j = 1, . . . , p iid draws from this distribution, we thus have wj iid N(0, Id) and aij = 0 w.p 1 for all i, j. (D3) The function ℓ(x, η) = KL [P(Θ | X = x) || Q(Θ; η)] is such that ℓ (x; η) is bounded uniformly for all x and for all η {f(x; ϕ(t)) : t > 0} by a constant M, uniformly over the width p. We recall that this notation is shorthand for ηℓ(x, η). (D4) The activation function σ( ) is non-polynomial and is Lipschitz with constant C. Note that the Lipschitz condition implies σ has bounded first derivative i.e. |σ (r)| C for all r R. With these conditions in hand, we now prove several lemmas for individual parameters. Lemma 4 (Lazy Training of w). For the width p scaled two-layer architecture above, assume that conditions (D1) (D4) hold. Let ϕ evolve according to the gradient flow of the objective LP , i.e. ϕ(t) = ϕLP (ϕ). Fix any T > 0. Then for all j = 1, . . . , p, ||wj(T) wj(0)||2 ||wj(0)||2 Dp,T + Ep,T (14) almost surely, where Dp,T , Ep,T are constants depending on p, T, and limp Dp,T = 0 and limp Ep,T = 0. Proof. First note that for any fixed j, we have Jwjf(x; ϕ) = wjf1(x; ϕ) ... wjfq(x; ϕ) a1jσ x wj x ... aqjσ x wj x as required, where aij R for i = 1, . . . , q and x X Rd from (D1). We can bound the operator 2-norm of this matrix by observing that for any y Rd we have ||Jwjf(x; ϕ)y||2 2 = 1 σ x wj 2 (x y)2 p ||aj||2 2 ||x||2 2 ||y||2 2 by (D4) and Cauchy-Schwarz = ||Jwjf(x; ϕ)||2 C p||aj||2 by observing ||x||2 2 is bounded by (D1) (and we absorb this term into the constant C). By similar computations, we also have Jajf(x; ϕ) = ajf1(x; ϕ) ... ajfq(x; ϕ) Using condition (D4), it follows that ||Jajf(x; ϕ)||2 |σ(x wj)| p |σ(0)| + C|x wj| p def = K + C|x wj| p K + C||wj||2 p by Cauchy-Schwarz and (D1),(D4), where throughout the following we let K := |σ(0)| and again ||x|| terms are absorbed into the constant C. Now we will use these computations to bound the variation on wj across the interval (0, T]. Fix any t (0, T]. Then we have ||wj(t) wj(0)||2 Z t 0 || wj(s)||ds 0 EP (X)||Jwjf(X; ϕ(s))ℓ (X, f(X; ϕ(s)))||2ds 0 EP (X)||Jwjf(X; ϕ(s))||2ds by (D3) 0 ||aj(s)||2ds by above work a.s. = C M p 0 ||aj(s) aj(0)||2ds by (D2) 0 || aj(v)||2dvds 0 EP (X)||Jajf(X; ϕ)||2||ℓ (X, f(X; ϕ(v)))||2dvds 0 EP (X)||Jajf(X; ϕ)||2dvds by (D3) 2p + C2 M 2 0 ||wj(v)||2dvds by above work 2p + C2 M 2 0 ||wj(v) wj(0)||2 + ||wj(0)||2dvds 2p + C2 M 2t2 2p ||wj(0)||2 + C2 M 2 0 ||wj(v) wj(0)||2dvds 2p + C2 M 2T 2 2p ||wj(0)||2 + C2 M 2 0 ||wj(v) wj(0)||2dvds 0 c2||wj(v) wj(0)||2dvds with c1 = C M 2KT 2 2p + C2 M 2T 2 2p ||wj(0)||2 and c2 = C2 M 2 p . Note that even though c1 depends on T, this is constant as T is fixed. We write these quantities in this way to recognize a Gronwall-type inequality that we can use to bound the left hand side. Indeed, by direct application of Theorem 57 of Dragomir (2003) (see Theorem 4) we have that ||wj(t) wj(0)||2 c1 exp Z t = c1 exp c2t2 2p + C2 M 2T 2 2p ||wj(0)||2 giving the result for t = T if we take Dp,T = C2 M 2T 2 2p exp h C2 M 2T 2 2p i and Ep,T = 2p exp h C2 M 2T 2 2p i . Clearly these constants satisfy limp Dp,T = 0 and limp Ep,T = 0 for any fixed T. Lemma 5 (Lazy Training of a). Under the same conditions as Lemma 4, let ϕ evolve according to the gradient flow of problem LP , i.e. ϕ(t) = ϕLP (ϕ). Fix any T > 0. Then we have for any j that ||aj(T)||2 ||wj(0)||2 Fp,T + Gp,T (15) almost surely, where Ep,T and Fp,T are constants depending on p, T satisfying limp Ep,T = 0 and limp Fp,T = 0. Proof. We will use much of the same work as in Lemma 4. Namely, ||aj(t)||2 = ||aj(t) aj(0)||2 almost surely by (D2), and thereafter for any t (0, T] we have ||aj(t) aj(0)||2 Z t 0 || aj(v)||2ds 0 EP (X)||Jajf(X; ϕ)||2||ℓ (X, f(X; ϕ(v)))||2ds 0 EP (X)||Jajf(X; ϕ)||2ds K Mt p + MC 0 ||wj(s)||2ds by work in Lemma 4 K Mt p + MC 0 ||wj(s) wj(0)||2 + ||wj(0)||2ds K Mt p + MCt p ||wj(0)||2 + MC 0 Dp,s||wj(0)||2 + Ep,sds by Lemma 4 K Mt p + MCt p ||wj(0)||2 + MC 0 Ep,sds + MC p ||wj(0)||2 = ||wj(0)||2 K Mt p + MC def = ||wj(0)||2 Fp,t + Gp,t. Clearly, these constants satisfy limp Fp,t 0 and limp Gp,t 0 (to see this, simply plug in the forms of Dp,s and Ep,s from Lemma 4 above) and we have the result by taking t = T. Now with these results in hand, we may state and prove Proposition 3, given below. Proposition 3. Under the same conditions as Proposition 2, fix any T > 0. For any t (0, T] let Kp ϕ(t) : X X Rq q be the mapping defined by (x, x) 7 Kϕ(t)(x, x) = Jϕf(x; ϕ(t))Jϕf( x; ϕ(t)) . Then provided conditions (D1) (D4) hold, we have as p that sup x, x X,t (0,T ] ||Kp ϕ(t)(x, x) Kp ϕ(0)(x, x)||F a.s. 0. (LT) Proof. Let us examine the k, lth term of the q q matrix given by Kp ϕ(t)(x, x) Kp ϕ(0)(x, x) for fixed x, x, and some t (0, T]. The k, lth term is given by (see the work in Appendix C): σ x wj(t) σ x wj(t) (16) σ x wj(0) σ x wj(0) akj(t)alj(t)σ x wj(t) σ x wj(t) x x (18) akj(0)alj(0)σ x wj(0) σ x wj(0) x x . (19) Above, we have explicitly made clear the dependence of the parameters on time, e.g. wj(t) vs. wj(0). We aim to show that the quantity above tends to zero as p . We first prove this holds pointwise, and will consider the red and blue terms one at a time for a fixed x, x. First consider the jth summand of the red term. We will bound its absolute value. If k = l, we re done, so assume k = l. We have for any j that σ x wj(t) σ x wj(t) σ x wj(0) σ x wj(0) = σ x wj(t) σ x wj(t) σ x wj(t) σ x wj(0) + σ x wj(t) σ x wj(0) σ x wj(0) σ x wj(0) |σ x wj(t) | |σ x wj(t) σ x wj(0) | + |σ x wj(0) | |σ x wj(t) σ x wj(0) | and by the Lipschitz assumption on σ( ) and Cauchy-Schwarz, we can bound the quantity above as follows (K + C||x||2||wj(t)||2) C|| x||2||wj(t) wj(0)||2 + (K + C||x||2||wj(0)||2) C||x||2||wj(t) wj(0)||2 = C2||wj(t) wj(0)||2 C + ||wj(t)||2 + ||wj(0)||2 since ||x||2, || x||2 are bounded by (D1) C2||wj(t) wj(0)||2 C + ||wj(t) wj(0)||2 + 2||wj(0)||2 by triangle inequality = 2CK||wj(t) wj(0)||2 + C2||wj(t) wj(0)||2 2 + 2C2||wj(0)||2||wj(t) wj(0)||2 where again ||x||2 has been absorbed into the constant C by (D1). Using Lemma 4, we can bound all terms above almost surely using ||wj(0)||2 as follows. 2CK (Dp,t||wj(0)||2 + Ep,t) + C2 (Dp,t||wj(0)||2 + Ep,t)2 + 2C2 Dp,t||wj(0)||2 2 + Ep,t||wj(0)||2 = 2C2Dp,t + C2D2 p,t ||wj(0)||2 2 + 2CKDp,t + 2C2Dp,t Ep,t + 2C2Ep,t ||wj(0)||2 + 2CKEp,t + C2E2 p,t Recalling that wj(0) iid N(0, Id), we have that ||wj(0)||2 and ||wj(0)||2 2 are integrable with expectations denoted µ and ν, respectively. All our work has allowed us to show that 1 p σ x wj(t) σ x wj(t) σ x wj(0) σ x wj(0) 2C2Dp,t + C2D2 p,t ||wj(0)||2 2 + 2CKDp,t + 2C2Dp,t Ep,t + 22CEp,t ||wj(0)||2 + 2CKEp,t + C2E2 p,t a.s. lim p 2C2Dp,t + C2D2 p,t ν + lim p 2CKDp,t + 2C2Dp,t Ep,t + 2C2Ep,t + lim p 2CKEp,t + C2E2 p,t by conditions on Dp,t and Ep,t from Lemma 4, the strong law of large numbers, and the classic result from analysis that limn anbn = (limn an) (limn bn), provided both limits on the right hand side exist. Lastly, we can achieve the same result for the blue term quickly. Because aij(0) = 0 w.p. 1 by (D2), we have almost surely that akj(t)alj(t)σ x wj(t) σ x wj(t) x x (((((((((((((((((((((( akj(0)alj(0)σ x wj(0) σ x wj(0) x x j=1 |akj(t)||alj(t)||σ x wj(0) ||σ x wj(0) |||x||2|| x||2 j=1 |akj(t)||alj(t)| j=1 ||aj(t)||2 2 because for all j, we have |akj|, |akj| are dominated by ||aj||2. From here, we have by Lemma 5 that we can bound each term in the sum above by j=1 (||wj(0)||2Fp,t + Gp,t)2 j=1 F 2 p,t||wj(0)||2 2 + 2Fp,t Gp,t||wj(0)||2 + G2 p,t as p by similar logic to the above. Together, these results combine to show that |Kp ϕ(t)(x, x)kl Kp ϕ(0)(x, x)kl| a.s. 0 as p . As k, l were arbitrary k, l 1, . . . , q, we have ||Kp ϕ(t)(x, x) Kp ϕ(0)(x, x)||F a.s. 0. This establishes pointwise convergence for some fixed t (0, T]. Uniform convergence over all of X X and all t (0, T] follows easily in this case. Firstly, the numbers Dp,t, Ep,t, Fp,t, and Gp,t are monotonic in t, so we can bound uniformly for all t (0, T] by taking t = T in the expressions above. Secondly, in our work above, our bounds on the red and blue terms were independent of the choice of point (x, x). More precisely, the supremum over x, x can be accounted for in the bounds easily by observing that supx, x X,t (0,T ] ||Kp ϕ(t)(x, x) Kp ϕ(0)(x, x)||F can be bounded above by σ x wj(t) σ x wj(t) σ x wj(0) σ x wj(0) + sup x, x X akj(t)alj(t)σ x wj(t) σ x wj(t) x x akj(0)alj(0)σ x wj(0) σ x wj(0) x x 2C2Dp,T + C2D2 p,T ||wj(0)||2 2 + 2CKDp,t + 2C2Dp,T Ep,T + 2C2Ep,T ||wj(0)||2 + 2CKEp,T + C2E2 p,T + sup x, x X j=1 F 2 p,T ||wj(0)||2 2 + 2Fp,T Gp,T ||wj(0)||2 + G2 p,T by the same work as above. E Kernel Gradient Flow Analysis We rely on additional regularity conditions outlined below. We will consider the following three flows in our proof of Theorem 1 (for some choice of p): ft(x) = EP (X)Kp ϕ(t)(x, X)ℓ (X, ft(X)) (20) gt(x) = EP (X)K (x, X)ℓ (X, gt(X)) (21) ht(x) = EP (X)Kp ϕ(0)(x, X)ℓ (X, ht(X)) (22) where ft is shorthand for f( ; ϕ(t)). The three flows above can be thought of as corresponding to LP , LF , and a lazy variant, respectively. The flow of ht is lazy because it follows the dynamics of a fixed kernel, the kernel at initialization. The flow of gt also follows a fixed kernel, but the limiting NTK K instead. The flow of ft is that obtained in practice, where the kernel Kp ϕ(t) changes continuously as the parameters ϕ(t) evolve in time. The flow in ht is used to bound the differences between ft and gt in the proof of Theorem 1. We now enumerate the regularity conditions. (E1) The functional LF (f) satisfies inff LF (f) > . (E2) The limiting NTK K is positive definite (so that the RKHS H with kernel K is welldefined). (E3) Under (E1) and (E2), the function f minimizing LF satisfies ||f ||H < , so that f H. (E4) For any choice of p, we have for all t, x that ℓ (x; ft(x)), ℓ (x; gt(x)), and ℓ (x; ht(x)) are bounded by a constant M. (E5) The function ℓis L-smooth in its second argument, i.e. ||ℓ (x, η1) ℓ (x, η2)|| L||η1 η2||. We first prove Lemma 2 from the manuscript. Proof. Let f argmin LF (f), where LF (f) is the functional objective. Hereafter L denotes LF . Then L(f ) > by (E1). Then if ft evolves according to the kernel gradient flow (21) above (i.e., the flow with kernel K ), we have (from the chain rule for Fréchet derivatives) that By definition, ft t = ft. We also have L ft : h 7 EP (X)ℓ (X, ft(X)) h(X). Plugging this in yields L(ft) = EX P (X)ℓ (X, ft(X)) EX P (X)K (X, X )ℓ (X , ft(X )) = EX,X P ℓ (X, ft(X)) K (X, X )ℓ (X , ft(X )) 0 by the positiveness of the kernel K (from (E2)). Now define t = 1 2||ft f ||2 H, where H is the vector-valued reproducing kernel Hilbert space corresponding to the kernel K (see Carmeli et al. (2006) for a detailed review). We let , denote the inner product on the space H. It follows that t ft : h 7 ft f , h . Then by the chain rule we have t = ft f , ft = ft f , EP (X)K ( , X)ℓ (X, ft(X)) = EP (X)ℓ (X, ft(X)) [ft(X) f (X)] EP (X)ℓ(X, ft(X)) ℓ(X, f (X)) = L(ft) L(f ). To go from the second to the third line, we used the reproducing property of the vector-valued kernel, the definition of inner product, and the linearity of integration. More precisely, the reproducing property (cf. Eq. (2.2) of Carmeli et al. (2006)) tells us for any functions g, h and fixed x, g, K ( , x)h(x) = g(x) h(x) and so the third line results from the second by exchanging the integral and inner product. In the second-to-last line, we used the convexity of ℓin its second argument (from Lemma 1 of the manuscript). Now consider the Lyapunov functional given by E(t) = t [L(ft) L(f )] + t. (23) Differentiating, we have E(t) = L(ft) L(f ) + t L(ft) + t 0 by the above work because i) t L(ft) 0 and ii) L(ft) L(f )+ t 0, implying that E(t) E(0) for all t. Evaluating, thus t [L(ft) L(f )] + t 0 t [L(ft) L(f )] 0 t t [L(ft) L(f )] 0 since t 0 [L(ft) L(f )] 1 and so we have that there exists a sufficiently large T such that |L(f T ) L(f )| ϵ as desired. Using this result and our previous results, we are now able to prove Theorem 1 from the manuscript. Proof. We will examine the three gradient flows ft(x) = EP (X)Kp ϕ(t)(x, X)ℓ (X, ft(X)) (24) gt(x) = EP (X)K (x, X)ℓ (X, gt(X)) (25) ht(x) = EP (X)Kp ϕ(0)(x, X)ℓ (X, ht(X)) (26) and establish the result by the triangle inequality, i.e. |L(f T ) L(f )| |L(f T ) L(g T )| + |L(g T ) L(f )|, (27) where L denotes the functional objective LF . Let ϵ > 0. The flow in ht will be used to help bound the first term, but we begin with the second term. By Lemma 2, pick T > 0 sufficiently large such that |L(g T ) L(f )| ϵ/2. Fix this T. This provides a suitable bound on the second term. Turning to the first term, by continuity of L(f) in f, there exists δ > 0 such that y B(g T , δ) = |L(y) L(g T )| ϵ/2. We will show that there exists P sufficiently large such that p > P implies ||f T g T || δ almost surely, yielding the desired bound on the first term of the decomposition above. Throughout, || || denotes the L2 norm of a function with respect to probability measure P (i.e. the marginal distribution of P(X) from our joint model P(Θ, X)). To show that there exists sufficiently large P such that ||f T g T || δ, we use another application of the triangle inequality ||f T g T || ||f T h T || + ||h T g T || and construct bounds on the two terms on the right hand side using Proposition 2 and Proposition 3. Observe first that by (C2)/(D2), at initialization we have almost surely that f0 = g0 = h0 = 0. Also note that by continuity of K (established in Lemma 3) on the compact domain X X we have supx, x ||K (x, x)||2 < M for some M. Finally, note that by (E5) the function ℓ (x, η) is Lipschitz in its second argument with constant L. Below, we let || ||2 denote the 2-norm of a vector or matrix, depending on the argument, and || ||F the Frobenius norm of a matrix. For functions, as stated || || denotes the L2 norm with respect to measure P(X), i.e. ||f||2 = R f(X) f(X)d P(X). From here, we have ||g T h T || a.s. = ||(g T g0) (h T h0)|| 0 EP (X) h K ( , X)ℓ (X, gt(X)) Kp ϕ(0)( , X)ℓ (X, ht(X)) i dt 0 EP (X)||K ( , X)ℓ (X, gt(X)) Kp ϕ(0)( , X)ℓ (X, ht(X))||dt 0 EP (X)||K ( , X)ℓ (X, gt(X)) K ( , X)ℓ (X, ht(X))+ K ( , X)ℓ (X, ht(X)) Kp ϕ(0)( , X)ℓ (X, ht(X))||dt 0 EP (X)||K ( , X) [ℓ (X, gt(X)) ℓ (X, ht(X))] ||+ ||K ( , X)ℓ (X, ht(X)) Kp ϕ(0)( , X)ℓ (X, ht(X))||dt (28) Now, we note the following facts. Firstly, for any kernel K that is uniformly bounded (i.e. ||K(x, y)||2 M for any x, y), the L2 norm of the function ||K( , X)v|| for fixed X, v can be bounded by ||K( , X)v||2 = Z v K(Y, X) K(Y, X)vd P(Y ) Z ||K(Y, X)||2 2||v||2 2d P(Y ) M 2||v||2 2 = ||K( , X)v|| M||v||2 Secondly, we have again for any fixed v and X that || h K ( , X) Kp ϕ(0)( , X) i v||2 = Z v h K (Y, X) Kp ϕ(0)(Y, X) i h K (Y, X) Kp ϕ(0)(Y, X) i vd P(Y ) Z ||K (Y, X) Kp ϕ(0)(Y, X)||2 2||v||2 2d P(Y ) sup x,y ||K (y, x) Kp ϕ(0)(y, x)||F = || h K ( , X) Kp ϕ(0)( , X) i v|| sup x,y ||K (y, x) Kp ϕ(0)(y, x)||F ||v||2 since the matrix (spectral) 2-norm is dominated by the Frobenius norm. Plugging these facts into Equation (28) above, we have 0 EP (X)M ||ℓ (X, gt(X)) ℓ (X, ht(X))||2 + sup x,y ||K (x, y) Kp ϕ(0)(x, y)||F ||ℓ (X, ht(X))||2dt 0 EP (X)M L||gt(X) ht(X)||2 + M sup x,y ||K (x, y) Kp ϕ(0)(x, y)||dt by (E4), (E5) MT sup x,y ||K (x, y) Kp ϕ(0)(x, y)||F + M L Z T ||gt(X) ht(X)||2 2dt MT sup x,y ||K (x, y) Kp ϕ(0)(x, y)||F + M L Z T EP (X)||gt(X) ht(X)||2 2dt by Jensen s inequality = MT sup x,y ||K (x, y) Kp ϕ(0)(x, y)||F + M L Z T 0 ||gt ht||dt = ||g T h T || MT sup x,y ||K (x, y) Kp ϕ(0)(x, y)||F exp(M LT) by Gronwall s inequality (Theorem 3). By Proposition 2, there thus exists P1 such that for all p > P1 we have ||g T h T || δ 2 almost surely. We proceed nearly identically for the term ||h T f T ||. We need only note that for sufficiently large p, say p > P2, we can bound Kp ϕ(0) uniformly (almost surely) by a constant A > M. To see this, observe that by Proposition 2 we have that there exists almost surely a sufficiently large P such that supx,y ||K (x, y) Kp ϕ(0)(x, y)||F < A M and so by triangle inequality we have sup x,y ||Kp ϕ(0)||F sup x,y ||Kp ϕ(0)(x, y) K (x, y)||F + ||K (x, y)||F sup x,y ||Kp ϕ(0)(x, y) K (x, y)||F + sup x,y ||K (x, y)||F A M + M = A Thereafter, ||h T f T || a.s. = ||(h T h0) (f T f0)|| 0 EP (X) h Kp ϕ(0)( , X)ℓ (X, ht(X)) Kp ϕ(t)( , X)ℓ (X, ft(X)) i dt 0 EP (X)||Kp ϕ(0)( , X)ℓ (X, ht(X)) Kp ϕ(t)( , X)ℓ (X, ft(X))||dt 0 EP (X)||Kp ϕ(0)( , X)ℓ (X, ht(X)) Kp ϕ(0)( , X)ℓ (X, ft(X))||+ ||Kp ϕ(0)( , X)ℓ (X, ft(X)) Kp ϕ(t)( , X)ℓ (X, ft(X))||dt 0 EP (X)A ||ℓ (X, ht(X)) ℓ (X, ft(X))||2+ sup x,y,t (0,T ] ||Kp ϕ(0)(x, y) Kp ϕ(t)(x, y)||F ||ℓ (X, ft(X))||dt MT sup x,y,t (0,T ] ||Kp ϕ(0)(x, y) Kp ϕ(t)(x, y)||F + A L Z T 0 EP (X)||ht(X) ft(X)||2dt and we can similarly switch from EP (X)||ht(X) ft(X)||2 to the L2 norm ||ht ft|| as above using Jensen s inequality, yielding MT sup x,y,t (0,T ] ||Kp ϕ(0)(x, y) Kp ϕ(t)(x, y)||F + A L Z T 0 ||ht ft||dt = ||h T f T || MT sup x,y,t (0,T ] ||Kp ϕ(0)(x, y) Kp ϕ(t)(x, y)||F exp A LT . Clearly, by the same logic as the above there exists P3 such that p > P3 implies MT supx,y,t (0,T ] ||Kp ϕ(0)(x, y) Kp ϕ(0)(x, y)|| exp(A LT) δ/2 by Proposition 3. Then for all p > max(P1, P2, P3), we have almost surely that ||h T f T || δ/2. This completes the proof, as in this case we have by the triangle inequality that ||f T g T || δ and so |L(f T ) L(g T )| ϵ/2 by construction. F Experimental Details Our code is publicly available at https://github.com/declanmcnamara/gcvi_neurips. We used Py Torch (Paszke et al., 2019) for our experiments in accordance with its license, and NVIDIA Ge Force RTX 2080 Ti GPUs. F.1 Toy Example Recall the generative model for this problem, given by the following: Θ Unif[0, 2π] X | (Θ = θ, Z = z) δ [cos(θ + z), sin(θ + z)] . The variable σ is a hyperparameter of the model that we take to be σ = 0.5. The model is constructed in such a way that x S1 satisfies assumptions (C1) and (D1), respectively. One thousand pairs of data points {θi, xi}1000 i=1 were generated independently from the model above and fixed as the dataset for which the ground-truth latent parameter values are known. We constructed scaled, dense single hidden-layer Re LU networks of varying widths, with 2j neurons for j = 6, . . . , 12 with the same architecture as in Appendix C and the initialization described in condition (C2). All networks were trained to minimize the expected forward KL objective; stochastic gradients were estimated using batches of 16 independent simulated (θ, x) pairs from the generative model, and SGD was performed using the Adam optimizer with a learning rate of ρ = 0.0001. We employ a learning rate scheduler that scales the learning rate as O(1/I), where I denotes the number of iterations. All models were fitted for 200,000 stochastic gradient steps, and the execution time is less than one hour. The natural parameter for the von Mises distribution is parameterized as η = f(x; ϕ) + 0.0001. This small perturbation must be added because f( ; ϕ) = 0 at initialization and because the value of η = 0 lies outside the natural parameter space for this variational family. For the linearized neural network models, all training settings were the same except for the architecture. For these models, we first constructed neural networks as above for each width to compute the Jacobian evaluated at the initial weights Jϕ(x; ϕ0). Thereafter, the model in ϕ is fixed as f(x; ϕ) = f(x; ϕ0) + Jϕ(x; ϕ0)(ϕ ϕ0) where ϕ, ϕ0 are flattened vectors of parameters from the neural network architectures. Using this linearized model above, the parameter ϕ is fitted by SGD as above. The plots in Figure 1 of the manuscript are constructed by evaluating the average negative loglikelihood on the dataset at each iteration, i.e. for the fixed n = 1000 pairs of observations above, we evaluate the finite-sample loss for the expected forward KL divergence. Up to fixed constants, this quantity is given by i=1 log q(θi; f(xi; ϕ)) where ϕ is the current iterate of the parameters (either the neural network parameters or the flattened vector of parameters of the same size for the linearized model). The horizontal red line in Figure 1 is set at the value 1 n Pn i=1 log p(θi | xi), where p denotes the exact posterior distribution (computed using numerical integration over a fine grid of evaluation points). F.2 Label Switching in Amortized Clustering Recall the amortized clustering model from the manuscript, given by S N(0, 1002) Z | (S = s) N µ1 + s ... µd + s Xi | (Z = z) j=1 pj N(zj, τ 2). We take σ = 1, τ = 0.1 and artificially fix s = 100 as our realization from the prior on S. Thereafter, we generate N = 1000 independent, identically distributed samples {xi}1000 i=1 from the model above, conditional on observing S = 100. The other non-random hyperparameters are the cluster centers µ = [ 20, 10, 0, 10, 20] , as well as the mixture weights for the mixture of Gaussians, which are uniform, i.e. pj = 1 d for all j. These draws constitute the data for this problem for which inference on S is desired conditionally on the draws x1, . . . , x1000; separately, we seek to infer the random variable Z conditional on the data as well. Although Z and S are closely related, we impose a mean-field variational family q(S, Z | X1 = x1, . . . , XN = xn) = q(S | X1 = x1, . . . , XN = x N)q(Z | X1 = x1, . . . , XN = x N) for ease, with both components taken to be isotropic Gaussian distributions. We use two different parameterizations for each of q(S) and q(Z). The mean parameterization fixes the variance at unity and aims to learn only the mean in this case, the natural parameter of this family of distributions is simply the mean, so the network parameterizes the variational means µS, µZ for each of these distributions. We also use a natural parameterization with an unknown mean and variance, and in this case, the natural parameterizations ηS, ηZ are output by the networks, which we denote f( ; ϕ) and f( ; ψ) for S and Z, respectively. Both amortized encoder networks f( , ϕ), f( ; ψ) take the entire set of points {x1, . . . , x1000} as input. The architecture should thus be permutation-invariant, as the order of these points does not matter for inference on the latent quantities of interest. We accomplish this by using a set transformer architecture for the networks, which achieves permutation invariance using self-attention blocks (Lee et al., 2019). Parameters are fitted to minimize LP using SGD with stochastic gradients estimated using simulated draws of the same size as the observed data. We use a learning rate of 0.001 for fitting both q(Z) and q(S), with schedules that decrease these as training progresses across 50,000 total gradient steps. We perform one hundred replicates of this experiment across different random seeds, each time generating a new dataset and refitting the networks. Accordingly, there is a high computational cost: using 10 parallel processes, the experiment takes about 8 hours to run. We compare LP -based minimization to fitting both networks to maximize an evidence lower bound (ELBO), using the IWBO with K = 10 importance samples as the objective, but otherwise keeping all other hyperparameters the same. Figure 2 in the manuscript plots the mode of q(S; f(x1, . . . , x N; ϕ)) across one hundred replicates of the experiment with different random seeds. The estimate should be approximately 100 to match the ground truth, and both ELBO-based and LP -based training perform well. For inference on Z, we extract the mode ˆZ R5 as the point estimate and compute the ℓ1 distance to the true draw Z = z for our dataset (which is known a priori because the data were generated synthetically). ELBO-based training illustrates label-switching behavior, converging to a vector which is a permutation of the entries of the true draw z, resulting in a large ℓ1 distance. LP -based fitting does not suffer from this pathological behavior and instead converges rapidly to the global optimum. F.3 Rotated MNIST We use the MNIST dataset, freely available from torchvision under the BSD-3 License1 to fit a GAN generative model of MNIST digits. The simplistic GAN uses dense networks for both the discriminator and generator and is fit to the data with binary cross-entropy loss. Training was stopped when the generator produced realistic images, sufficient for our problem (see Figure 3). The variational distribution q(Θ; f(x1, . . . , x N; ϕ)) on the shared rotation angle is taken to be von Mises for minimization of LP . We parameterize q using its natural parameterization as in Appendix F.1. The encoder for Θ uses three Conv2D layers to increase the number of channels to 64 (with a kernel size of 3 and a stride length of 2), followed by a flattening layer and a three-layer dense network with Leaky Re LU activations. Permutation invariance is imposed by a naive averaging step across all observations in the dataset {xi}N i=1 that are provided to the network. The neural network architecture for this example thus differs significantly from the simple two-layer Re LU network, yet still demonstrates the same global convergence behavior. We compare to a semi-amortized approach for maximizing the IWBO for this example. To compute the IWBO, the practitioner must posit a variational distribution on zi for each image xi. This is because only the full joint likelihood p(θ, {zi}N i=1, {xi}N i=1) is readily available. Accordingly, for this method we construct a second network f( ; ψ) used to construct a variational distribution q(Zi; f(Xi; ψ)) on Zi for a given Xi. This network is amortized over all images, yielding q(Zi; f(xi; ψ)) for any image xi in the dataset {xi}N i=1. We parameterize these distributions on the latent representation as multivariate (isotropic) Gaussian with mean and log-scale parameters for each dimension the outputs of an encoder network. The architecture for this encoder consists of three Conv2D layers, each with kernel size 3 and stride of length 2. Across the three layers we increase the number of channels from a single channel (the input) up to 8 channels. After the convolutional layers, we perform a flattening layer followed a 2-layer dense network with Re LU activations. As the latent dimension for the data is known to be 64, the outputs of the encoder are 128-dimensional to parameterize the mean and log-scale across each dimension. As there is only one rotation angle of interest we directly maximize the joint likelihood p(θ, {zi}N i=1, {xi}N i=1) in θ by targeting the IWBO. Conceptually, this approach is the same as placing a point mass variational family on Θ, i.e., we simply initialize θ0 prior to training and update its value directly to maximize the IWBO. This approach thus fits the parameters {ψ, θ0} of q(Zi; f(xi; ψ)) and δθ0, the distributions on the latent representations and the angle, respectively. 1https://github.com/Ui Path/torchvision/blob/master/LICENSE These parameters are optimized using the Adam optimizer with initial learning rate 0.01 and square summable learning rate schedule. The angle parameter value is initialized at a variety of angles in different trials to produce Figure 4. The advantageous marginalization properties of LP -based fitting allow it to perform inference on θ without performing inference on the latent representations Zi; thus, the distributions q(Zi; f(xi; ψ)) need not be fitted when using this approach. We use the Adam optimizer for all parameter updates with initial learning rate of 0.0001 (and a square summable learning rate schedule) for LP -based training, which only fits the parameters ϕ of q(Θ; f(x1, . . . , x N; ϕ)). Minimization of LP trains for 100,000 gradient steps, although convergence is rapid, and so trajectories are truncated to produce Figure 4. Runs were performed in parallel across multiple processes for both methods and completed in less than one hour. F.4 Local Optima vs. Global Optima For this experiment, the setting above is modified slightly to remove the nuisance latent variables {Zi} and put ELBO-based optimization on more equal footing with expected forward KL minimization. The data are generated as follows: for i = 1, . . . , 50, we fix latents zi iid p(z), and θtrue = 260 degrees is set artificially. The digits x0 i are computed via the generative adversarial network and thereafter fixed for the rest of the problem (the superscript denotes a rotation angle of zero degrees). With the unrotated digits fixed, the only random variables in the model are Θ and the rotated versions of the digits. Conditioning on the observed data {xi}50 i=1, Θ is thus the only latent to be inferred. In this setting, the ELBO can compute the full likelihood function p(θ, {xi}50 i=1) for any value of Θ = θ (the same Uniform(0, 2π) is placed on the angle). The simulation-based approach of expected forward KL minimization can proceed as usual: we draw Θ Uniform(0, 2π) and simulated digits are obtained by rotating the digits according to the drawn angle and adding noise. The variational distribution, as stated in the text, is Gaussian with fixed variance instead of the von Mises distribution used in the local optima section. This is necessary for a one-to-one comparison between the two methods, as the von Mises distribution is not reparameterizable for ELBO-based training. The mean of the Gaussian is parameterized via the same neural network architecture above for fitting by both the ELBO and the expected forward KL. We use the Adam optimizer with learning rate 0.001 for 10,000 gradient steps. For computing some of the metrics in Figure 7, we rely on approximations as these quantities are difficult to compute exactly. Letting xtrue denote the observed data {xi}50 i=1, the forward KL divergence KL(p(θ | xtrue) || q(θ | xtrue)) is estimated using self-normalized importance sampling with K = 1, 000 samples drawn from the prior as a proposal. The reverse KL divergence KL(q(θ | xtrue)) || p(θ | xtrue) is computed as the difference log p(x) ELBO(q); the log-evidence is approximated via the importance weighted bound (IWBO) with K = 1, 000. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The abstract and introduction outline the main contributions of our work relative to preceding work, making clear that our results are primarily theoretical while still having implications for practitioners. We make clear that for our results to hold, regularity conditions must be satisfied. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In Section 6, we articulate the main conditions necessary for results to hold. These are enumerated even more explicitly in Appendices C, D, and E. We acknowledge in Section 6 that in practice these assumptions may not be met, and in our experiments we anticipate several settings where the assumptions are not met exactly, but our results still approximately hold. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: We include several lemmas and/or theorems in the main text. Due to space constraints, we do not enumerate all assumptions in the statements, but we do reference the assumptions numerically and list them explicitly in the relevant appendices. Our appendices contain detailed proofs of all lemmas and theorems provided. As the proof of Theorem 1 is rather involved, we devote a large portion of Section 4 to sketching the proof and providing intuition for the reader. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide detailed information in Appendix F to allow for reproduction of results. If accepted, we will publicly release code to reproduce our experiments exactly. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: In the appendix, we include a link to the Git Hub repository with all code necessary to reproduce experiments. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide detailed information on all experiments in-text, with additional implementation details in the relevant appendices. We also plan to publicly release code for this paper on Git Hub, where experiments can be replicated and various details (e.g., network architectures) can be examined to understand the results. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Where reasonable, we show statistical significance by aggregating runs across different random seeds. We do this in Section 5.2. In our other experiments, which illustrate optimization trajectories, we omit error bars as this type of data does not readily admit uncertainty estimates. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: In Appendix F, we detail the hardware used (GPUs), the memory requirements, and the approximate runtime for each of our experiments. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have read the Code of Ethics and verify that the paper complies. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Our work is primarily theoretical in nature, although we hope practitioners will utilize it to improve inference in probabilistic modeling. While the actions of individual practitioners have the potential for positive or negative societal impacts, our work is several stages removed from direct paths to applications yielding undesirable outcomes for society. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not release any models with this paper that carry risk for misuse. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We credit and cite Py Torch in our appendices, used with license to implement experiments. Our other main asset used is the MNIST dataset, whose license we name in Appendix F. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: No new assets are released with this work. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: There are no human participants or crowdsourcing used in this work. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: There are no human participants or crowdsourcing used in this work. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.