# greedy_inference_with_structureexploiting_lazy_maps__3c116624.pdf Greedy inference with structure-exploiting lazy maps Michael C. Brennan Massachusetts Institute of Technology Cambridge, MA 02139 USA mcbrenn@mit.edu Daniele Bigoni Massachusetts Institute of Technology Cambridge, MA 02139 USA dabi@mit.edu Olivier Zahm Université Grenoble Alpes, INRIA, CNRS, LJK 38000 Grenoble, France olivier.zahm@inria.fr Alessio Spantini Massachusetts Institute of Technology Cambridge, MA 02139 USA alessio.spantini@gmail.com Youssef Marzouk Massachusetts Institute of Technology Cambridge, MA 02139 USA ymarz@mit.edu We propose a framework for solving high-dimensional Bayesian inference problems using structure-exploiting low-dimensional transport maps or flows. These maps are confined to a low-dimensional subspace (hence, lazy), and the subspace is identified by minimizing an upper bound on the Kullback Leibler divergence (hence, structured). Our framework provides a principled way of identifying and exploiting low-dimensional structure in an inference problem. It focuses the expressiveness of a transport map along the directions of most significant discrepancy from the posterior, and can be used to build deep compositions of lazy maps, where low-dimensional projections of the parameters are iteratively transformed to match the posterior. We prove weak convergence of the generated sequence of distributions to the posterior, and we demonstrate the benefits of the framework on challenging inference problems in machine learning and differential equations, using inverse autoregressive flows and polynomial maps as examples of the underlying density estimators. 1 Introduction Inference in the Bayesian setting typically requires the computation of integrals R f dπ over an intractable posterior distribution whose density2 π is known up to a normalizing constant. One approach to this problem is to construct a deterministic nonlinear transformation, i.e., a transport map [57], that induces a coupling of π with a tractable distribution ρ (e.g., a standard Gaussian). Formally, we seek a map T that pushes forward ρ to π, written as T ρ = π, such that the change of variables R f dπ = R f T dρ makes integration tractable. Many constructions for such maps have been developed in recent years. Normalizing flows (see [34, 42, 46, 54] and references therein) build transport maps via a deep composition of functions These authors contributed equally to this work. 2In this paper, we only consider distributions that are absolutely continuous with respect to the Lebesgue measure on Rd, and thus will use the notation π to denote both the distribution and its associated density. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. parameterized by neural networks, with certain ansatzes to enable efficient computation. Many recently proposed autoregressive flows (for example [17, 20, 27, 31, 43]) compose triangular maps, which allow for efficient evaluation of Jacobian determinants. In general, triangular maps [9, 33, 47] are sufficiently general to couple any absolutely continuous pair of distributions (ρ, π), and their numerical approximations have been investigated in [29, 38, 40, 52]. The flow map of a neural ordinary differential equation [13, 21, 23] can also be seen as an infinite-layer limit of a normalizing flow. Alternatively, Stein variational methods [18, 35, 36] provide a nonparametric way of constructing T as a composition of functions lying in a chosen RKHS. In general, it can be difficult to represent expressive maps in high dimensions. For example, triangular maps on Rd must describe d-variate functions and thus immediately encounter the curse of dimensionality. Similarly, kernel-based methods lose expressiveness in high dimensions [12, 18]. Flow-based methods often increase expressiveness by adding layers, but this is typically performed in an ad hoc or unstructured way, which also requires tuning. Here we propose a framework for inference that creates target-informed architectures around any class of transport maps or normalizing flows. In particular, our framework uses rigorous a priori error bounds to discover and exploit low-dimensional structure in a given target distribution. It also provides a methodology for efficiently solving high-dimensional inference problems via greedily constructed compositions of structured low-dimensional maps. The impact of our approach rests on two observations. First, the coordinate basis in which one expresses a transport map (i.e., T(x) versus UT(x), where U is a rotation on Rd) can strongly affect the training behavior and final performance of the method. Our framework identifies an ordered basis that best reveals a certain low-dimensional structure in the problem. Expressing the transport map in this basis focuses the expressiveness of the underlying transport class and allows for principled dimension reduction. This basis is identified by minimizing an upper bound on the Kullback Leibler (KL) divergence between π and its approximation, which follows from logarithmic Sobolev inequalities (see [59]) relating the KL divergence to gradients of the target density. Second, in the spirit of normalizing flows, we seek to increase the expressiveness of a transport map using repeated compositions. Rather than specifying the length of the flow before training, we increase the length of the flow sequentially. For each layer, we apply the framework above to a residual distribution that captures the deviation between the target distribution and its current approximation. We prove weak convergence of this greedy approach to the target distribution under reasonable assumptions. This sequential framework enables efficient layer-wise training of highdimensional maps, which especially helps control the curse of dimensionality in certain transport classes. As we shall demonstrate empirically, the greedy composition approach can further improve accuracy at the end of training, compared to baseline methods. Since Markov chain Monte Carlo (MCMC) methods are also a workhorse of inference, it is useful to contrast them with the variational methods discussed above. In general, these two classes of methods have different computational patterns. In variational inference, one might spend considerable effort to construct the approximate posterior, but afterwards enjoys cheap access to samples and normalized evaluations of the (approximate) target density. How well the approximation matches the true posterior depends on the expressiveness of the approximation class and on one s ability to optimize within this class. MCMC, in contrast, requires continual computational effort (even after tuning), but (asymptotically) generates samples from the exact posterior. Yet there is a line of work that uses transport to improve the performance of MCMC methods ([26, 44]) such that even if one desires exact samples, constructing a transport map can be beneficial. We will demonstrate this link in our numerical experiments. Preliminaries. We will consider target distributions with densities π on Rd that are differentiable almost everywhere and that can be evaluated up to a normalizing constant. Such a target will often be the posterior of a Bayesian inference problem, e.g., π(x) := p(x|y) Ly(x)π0(x), where Ly(x) := p(y|x) is the likelihood function and π0 is the prior. We denote the standard Gaussian density on Rd as ρ. We will consider maps T : Rd Rd that are diffeomorphisms,3 and with some abuse of 3In general T does not need to be a diffeomorphism, but only a particular invertible map; see Appendix B for more details. The distributions we will consider in this paper, however, fulfill the necessary conditions for T to be differentiable almost everywhere. notation, we will write the pushforward density of ρ under T as T ρ(x) := ρ T 1(x)| T 1(x)|. We will frequently also use the notion of a pullback distribution or density, written as T π := (T 1) π. In 2 we show how to build a single map in the low-dimensional lazy format described above, and describe the class of posterior distributions that admit such structure. In 3 we develop a greedy algorithm for building deep compositions of lazy maps, which effectively decomposes any inference problem into a series of lower-dimensional problems. 4 presents numerical experiments highlighting the benefits of the lazy framework. While our numerical experiments employ inverse autoregressive flows [31] and polynomial transport maps [29, 40] as the underlying transport classes, we emphasize that the lazy framework is applicable to any class of transport. 2 Lazy maps Given a unitary matrix U Rd d and an integer r d, let Tr(U) be the set that contains all the maps T : Rd Rd of the form T(z) = U τ(z1, . . . , zr) z = Urτ(z1, . . . , zr) + U z (1) for some diffeomorphism τ : Rr Rr. Here Ur Rd r and U Rd (d r) are the matrices containing respectively the r first and the d r last columns of U, and z = (zr+1, . . . , zd). Any map T Tr(U) is called a lazy map with rank bounded by r, as it is nonlinear only with respect to the first r input variables z1, . . . , zr and the nonlinearity is contained in the low-dimensional subspace range(Ur). The next proposition gives a characterization of all the densities T ρ when T Tr(U). Proposition 1 (Characterization of lazy maps). Let U Rd d be a unitary matrix and let r d. Then for any lazy map T Tr(U), there exists a strictly positive function f : Rr R>0 such that T ρ(x) = f(U r x)ρ(x), (2) for all x Rd where ρ is the density of the standard normal distribution. Conversely, any probability density function of the form f(U r x)ρ(x) admits a representation as in (2) for some T Tr(U). The proof is given in Appendix A.1. By Proposition 1, any posterior density π(x) Ly(x)π0(x) with standard Gaussian prior π0 = ρ and with likelihood function given by Ly(x) f(U r x) can be written exactly as π = T ρ for some lazy map T Tr(U). In particular, posteriors of generalized linear models naturally fall into this class; see Appendix D for more details. Following [59, Section 2.1], the solution T Tr(U) to DKL(π||T ρ) = min T Tr(U) DKL(π||T ρ), is such that T ρ(x) = f (U r x)ρ(x), where f is the conditional expectation f (xr) = E π(X) ρ(X) |U r X = xr Now that we know the optimal lazy map in Tr(U), it remains to find a suitable matrix U and rank r. In Appendix A.2 we show that DKL(π||T ρ) = DKL(π||ρ) DKL((U r ) πr||ρr), (3) where ρr is the density of the standard normal distribution on Rr and (U r ) π is the density of U r X with X π. Thus, for fixed r, minimizing DKL(π||T ρ) over U is the same as finding the most non-Gaussian marginal (U r ) π. Such an optimal U can be difficult to find in practice. The next proposition instead gives a computable bound on DKL(π||T ρ), which we will use to construct a U suitable for our algorithm. The proof is given in Appendix A.3. Proposition 2. Let (λi, ui) R 0 Rd be the i-th eigenpair of the eigenvalue problem Hui = λiui where H = R ( log π ρ ) dπ. Let U = [u1, . . . , ud] Rd d be the matrix containing the eigenvectors of H. Then for any r d we have DKL(π||T ρ) 1 2(λr+1 + . . . + λd). (4) Proposition 2 suggests constructing U as the matrix of eigenvectors of H, and that a fast decay in the spectrum of H allows a lazy map with low r to accurately represent the true posterior. Indeed, one can guarantee DKL(π||T ρ) < ε by choosing r to be the smallest integer such that the left-hand side of (4) is below ε. In practice, since the complexity of representing and training a transport map may strongly depend on r, we can bound r by some rmax d associated with a computational budget for constructing T. This procedure is summarized in Algorithm 1. The practical implementation of Algorithm 1 relies on the computation of H. Direct Monte Carlo estimation of H, however, requires generating samples from π, which is not feasible in practice. Instead one can use an importance sampling estimate, taking k=1 ωk( log π ρ (Xk))( log π where {Xk}K k=1 are i.i.d. samples from ρ and ωk = π(Xk) ρ(Xk) /(PK k =1 π(Xk ) ρ(Xk ) ) are self-normalized weights. This estimate can have significant variance when ρ is a poor approximation to the target π (e.g., in the first stage of the greedy algorithm in 3). In this case it is preferable to impose ωk = 1, which reduces variance but yields an biased estimator of H; instead, it is an unbiased estimator of HB = R ( log π ρ ) dρ. As shown via the error bounds in [59, Sec. 3.3.2] this matrix still provides useful information regarding the construction of U. We consider the differences between the two estimators in Appendix E. Also, since the effective sample size (ESS) of the importance sampling estimate can be computed with little extra cost after collecting samples, one can use this ESS to choose whether to use H or HB. Other variance reduction methods may also be applicable. For example, simplifications or approximations to the expected outer product of score functions yield natural candidates for control variates. In constructing a lazy map T of the form (1), one needs to identify a map τ : Rr Rr such that T ρ approximates the posterior. One can use any transport class to parameterize τ; Appendices B and C detail the particular maps used in our numerical experiments. In our setting we can only evaluate π up to a normalizing constant, and thus it is expedient to minimize the reverse KL divergence DKL(T ρ||π) = DKL(ρ||T π), as is typical in variational Bayesian methods which can be achieved by maximizing a Monte Carlo or quadrature approximation of Eρ log T π . This is equivalent to maximizing the evidence lower bound (ELBO) and using the reparameterization trick [32] to write the expectation over the base distribution ρ. Details on the numerical implementation of Algorithm 1 are given in Appendix F. We note that the lazy framework works to control the KL divergence in the inclusive direction, while optimizing the ELBO minimizes the KL divergence in the exclusive direction. We show empirically that this computational strategy provides performance improvements in both directions of the KL divergence between the true and approximate posterior, compared to a baseline that does not utilize the lazy framework. Algorithm 1 Construction of a lazy map. 1: procedure LAZYMAP(π, ρ, ε, rmax) 2: Compute H = R ( log π ρ ) dπ 3: Solve the eigenvalue problem Hui = λiui 4: Let r = rmax min{r d : 1 i>r λi ε} and assemble U = [u1, . . . , ud]. 5: Find T by solving min T Tr(U) DKL(T ρ||π) 6: return lazy map T 7: end procedure 3 Deeply lazy maps The restriction r rmax in Algorithm 1 helps control the computational cost of constructing the lazy map, but unless a problem admits sufficient lazy structure, T ρ may not adequately approximate the posterior. To extend the numerical benefits of the lazy framework to general problems, we consider the deeply lazy map Tℓ, a composition of ℓlazy maps: Tℓ= T1 . . . Tℓ, Tk Tr(U k), Algorithm 2 Construction of a deeply lazy map 1: procedure LAYERSOFLAZYMAPS(π, ρ, ε, r, ℓmax) 2: Set π0 = π and ℓ= 0 3: while ℓ ℓmax and 1 2 Tr(Hℓ) ε do 4: ℓ ℓ+ 1 5: Compute Tℓ= LAZYMAP(πℓ 1, ρ, 0, r) Algorithm 1 6: Update Tℓ= Tℓ 1 Tℓ 7: Compute πℓ= (Tℓ) π 8: Compute Hℓ= R ( log πℓ ρ )( log πℓ ρ ) dπℓ 9: end while 10: return Tℓ= T1 Tℓ 11: end procedure where each Tk is a lazy map associated with a different unitary matrix U k Rd d. For simplicity we consider the case where each lazy layer Tk has the same rank r, though it is trivial to allow the ranks to vary from layer to layer. In general, the composition of lazy maps is not itself a lazy map. For example, there exists U 1 = U 2 such that T2 = T1 T2 can depend nonlinearly on each input variable and so T2 cannot be written as in (1). The diagnostic matrix H allows us to build deeply lazy maps in a greedy way. After ℓ 1 iterations, the composition of maps Tℓ 1 = T1 . . . Tℓ 1 has been constructed. We seek a unitary matrix U ℓ Rd d and a lazy map Tℓ Tr(U ℓ) such that (Tℓ 1 Tℓ) ρ best improves over (Tℓ 1) ρ as an approximation to the posterior. To this end, we define the residual distribution πℓ 1 = (Tℓ 1) π, i.e., the pullback of π through the current transport map Tℓ 1. Note that DKL(π||(Tℓ 1 Tℓ) ρ) = DKL(πℓ 1||(Tℓ) ρ). We thus build Tℓusing Algorithm 1, replacing the posterior π by the residual distribution πℓ 1. We then update the transport map to be Tℓ= Tℓ 1 Tℓand the residual density πℓ= (Tℓ) π. We note that applying Proposition 2 to πℓwith r = 0 yields DKL(π||(Tℓ) ρ) = DKL(πℓ||ρ) 1 2(λ1 + + λd) = 1 where we define the diagnostic matrix at iteration ℓas, Hℓ= Z log πℓ Our framework thus naturally exposes the error bound 1 2 Tr (Hℓ) on the forward KL divergence, which is of independent interest and applicable to any flow-based method. We refer to this bound as the trace diagnostic. This bound can also be used as a stopping criterion for the greedy algorithm; one can continue adding layers until the bound falls below some desired threshold. This construction is summarized in Algorithm 2, and details on its numerical implementation are given in Appendix F. The next proposition gives a sufficient condition on U ℓto guarantee the convergence of our greedy algorithm. The proof is given in Appendix A.4. Proposition 3. Let U 1, U 2, . . . be a sequence of unitary matrices. For any ℓ 1, we let Tℓ Tr(U ℓ) be a lazy map that minimizes DKL(πℓ 1||(Tℓ) ρ), where πℓ 1 = (T1 . . . Tℓ 1) π. If there exists 0 < t 1 such that for any ℓ 1 DKL((U ℓ r ) πℓ 1||ρr) t sup U Rd d s.t. UU =Id DKL((U r ) πℓ 1||ρr), (5) then (T1 . . . Tℓ) ρ converges weakly to π. 0 2 4 6 8 10 12 Lazy iteration 1 2V[log /T ] (a) Convergence rate (b) Target π (c) T 1π (d) T 2π (e) T 3π (f) T 5π (g) T 8π Figure 1: Convergence of the algorithm for the approximation of the rotated banana distribution. (a) Decay of the bound 1 2 Tr(HB ℓ) on the KL-divergence DKL(π (Tℓ) ρ) and the variance diagnostic 1 2Vρ[log ρ/T ℓπ]. (b) The target density π. (c g) The target distribution is progressively Gaussianized by the maps Tℓ. Let us comment on the condition (5). Recall that the unitary matrix U that maximizes DKL((U r ) πℓ 1||ρr) is optimal; see (3). By (5), the case t = 1 means that U ℓis optimal at each iteration. This corresponds to an ideal greedy algorithm. The case 0 < t < 1 allows suboptimal choices for U ℓwithout losing the convergence property of the algorithm. Such a greedy algorithm converges even with a potentially crude selection of U ℓthat corresponds to a t close to zero. This also is why an approximation to Hℓis expected to be sufficient; see Section 4. We emphasize that condition (5) must apply simultaneously to all layers for a given 0 < t 1. Following [55], one could relax this condition by replacing t with a sequence (tℓ) that goes to zero sufficiently slowly. This development is left for future work. Finally, note that Proposition 3 does not require any constraints on r, so we have convergence even with r = 1, where each layer only acts on a single direction at a time. 4 Numerical examples We present numerical demonstrations of the lazy framework as follows. We first illustrate Algorithm 2 on a 2-dimensional toy example, where we show the progressive Gaussianization of the posterior using a sequence of 1-dimensional lazy maps. We then demonstrate the benefits of the lazy framework (Algorithms 1 and 2) in several challenging inference problems. We consider Bayesian logistic regression and a Bayesian neural network, and compare the performance of a baseline transport map to lazy maps using the same underlying transport class. We measure performance improvements in four ways: (1) the final ELBO achieved by the transport maps after training; (2 and 3): the final trace diagnostics 1 2 Tr(HB ℓ) and 1 2 Tr(Hℓ), which bound the error DKL(π||(Tℓ) ρ); and (4) the variance diagnostic 1 2Vρ[log ρ/T ℓπ], which is an asymptotic approximation of DKL((Tℓ) ρ||π) as (Tℓ) ρ π (see [40]). Finally, we highlight the advantages of greedily training lazy maps in a nonlinear problem defined by a high-dimensional elliptic partial differential equation (PDE), often used for testing high-dimensional inference methods [4, 16, 53]. Here, the lazy framework is needed to make variational inference tractable by controlling the total number of map parameters. We also illustrate the utility of such flows in preconditioning Markov chain Monte Carlo (MCMC) samplers [26, 44], or equivalently as a way of de-biasing the variational approximation on these three problems. Numerical examples are implemented 4 both in the Transport Maps framework [7] and using the Tensor Flow probability library [19]. The PDE considered in 4.4 is discretized and solved using the FEni CS [37] and dolfin-adjoint [22] packages. 4Code for the numerical examples can be found at https://github.com/Michael CBrennan/lazymaps and http://bit.ly/2Qlel XF. Data for 4.4, G.4, and G.5 can be downloaded at http://bit.ly/2X09Ns8, http://bit.ly/2Hyt Qc0 and http://bit.ly/2Eug5ZR. 4.1 Illustrative toy example We first apply the algorithm on the standard problem of approximating the rotated banana distribution Q πX1,X2 defined by X1 N(0.5, 0.8) and X2|X1 N(X2 1, 0.2), and where Q is a random rotation. We restrict ourselves to using a composition of rank-1 lazy maps. We consider degree 3 polynomial maps as the underlying transport class. We use Gauss quadrature rules of order 10 for the discretization of the KL divergence and the approximation of HB ℓ(m = 121 in Algorithm 3 and 5). Figure 1b shows the target distribution π := πX1,X2. Figure 1a shows the convergence of the algorithm both in terms of the trace diagnostic 1 2 Tr(HB ℓ) and in terms of the variance diagnostic. After two iterations the algorithm has explored all directions of R2, leading to a fast improvement. The convergence stagnates once the trade-off between the complexity of the underlying transport class and the accuracy of the quadrature has been saturated. Figures 1c g show the progressive Gaussianization of the residual distributions T ℓπ for different iterations ℓ. 4.2 Bayesian logistic regression We now consider a high-dimensional Bayesian logistic regression problem using the UCI Parkinson s disease classification data [1], studied in [49]. We consider the first 500 provided attributes consisting mainly of patient audio extensions. This results in a d = 500 dimensional inference problem. We choose a relatively uninformative prior of N(0, 102Id). Here we consider inverse autoregressive flows (IAFs) [31] for the underlying transport class. Details on the IAF structure, our choice of hyper-parameters, and training procedure are in Appendix C. As noted in 2 and shown in Appendix D, generalized linear models can admit an exactly lazy structure, where the lazy rank r of the posterior is bounded by the number of observations. We demonstrate this by first considering a small subset of 20 observations. Given a sufficiently expressive underlying transport class, a single lazy map of rank r = 20 can exactly capture the posterior. We compare four transport maps: a baseline IAF map; U-IAF, which is a 1-layer lazy map with rank r = d = 500 expressed in the computed basis U; Ur-IAF and Ur-IAF-500, which are 1-layer lazy maps of rank r = 20. The baseline IAF, U-IAF and Ur-IAF each use autoregressive networks with a hidden dimension equal to the input dimension of the flow (d = 500 in the case of the baseline IAF and U-IAF, 20 in the case of Ur-IAF). For Ur-IAF-500, we use a hidden dimension of 500, resulting in a map with approximately the same number of flow parameters as the baseline and U-IAF maps. Results are summarized in Table 1. We see improved performance in each of the lazy maps compared to the baseline. We also note that Ur-IAF outperforms U-IAF in each metric median, even though the U-IAF map has more flow parameters than the Ur-IAF map (4008000 vs 6720). The Ur-IAF-500 map performs the best in each metric. This map has the highest ratio of map parameters to active dimensions. This highlights a key benefit of the lazy framework: the ability to focus the expressiveness of a transport map along particular subspaces important to the capturing the posterior. 0 20 40 60 80 100 Eigenvalue index 101 102 103 104 105 106 107 108 Eig(HB) Eig(HB 1 ) Figure 2: Leading eigenvalues of the diagnostic matrices HB ℓfor the G3-IAF map applied to the full rank logistic regression problem. The spectrum flattens and falls as the approximation to the posterior improves. Next we consider a full rank Bayesian logistic regression problem using 605 observations. Here we compare a baseline IAF; U-IAF defined as before; and a 3-layer lazy map trained via the greedy Algorithm 2, denoted G3-IAF. In G3-IAF, each layer has rank r = 200. Results are summarized in Table 1, and again we see improvements in each of the performance metrics compared to the baseline IAF. Recall that the basis U relates to a bound on the inclusive KL direction, while the objective function for map training within a layer optimizes the exclusive KL direction. Empirically we see benefits in metrics relating to both directions. Interestingly, we observe that U-IAF achieves the greatest ELBO while G3-IAF achieves the lowest trace diagnostics. This suggests that using a larger number of lazy layers tends to lead to improvements to the inclusive KL divergence. Also, though we chose to use the same number of training iterations in each case, we observe that training of the lazy maps converges more quickly; see Appendix G.1 for addition details. Table 1: Result summaries for the Bayesian logistic regression and Bayesian neural network examples. Values reported are the median and (interquartile range) across 10 trials with randomized initialization. Best performance is bolded. ELBO computed using the median of the baseline. Map ELBO* ( ) Variance diagnostic ( ) Tr(HB ℓ)/2 ( ) Tr(Hℓ)/2 ( ) Low rank Bayesian logistic regression Baseline IAF 26.5 (3.88) 104 (9.19) 31.3 (15.6) U-IAF 6.72 (0.469) 7.48 (1.37) 37.4 (2.83) 20.2 (7.91) Ur-IAF 8.27 (0.249) 5.68 (0.935) 33.8 (4.29) 19.1 (7.68) Ur-IAF-500 10.9 (0.227) 1.66 (0.496) 8.89 (6.63) 6.19 (0.896) Full rank Bayesian logistic regression Baseline IAF 209 (21.4) 956 (68.6) 350 (178) U-IAF 26.4 (1.19) 130 (12.3) 623 (19.5) 287 (96.1) G3-IAF 1.68 (1.56) 109 (8.25) 510 (21.6) 219 (110) Bayesian neural network Baseline Affine 1.6e4 (5.8e4) 3.5e5 (6.9e5) 960 (1.0e3) G3-Affine 47.7 (2.33) 97.5 (6.47) 1.06e3 (56.2) 606 (201) As discussed in the introduction, a powerful use case for transport maps is the ability to precondition an MCMC method as described in [26, 44, 45], i.e., using the computed map to improve the posterior geometry. Applying Hamiltonian Monte Carlo [41] to the full rank Bayesian logistic regression problem (in particular, sampling the pullback T ℓπ where Tℓis the learned U-IAF map), we achieve worst, best, and average component-wise effective sample sizes of 0.39%, 1.8%, and 0.99%, compared to 0.056%, 0.12%, and 0.065% without a transport map (sampling the target π directly). Note that applying Tℓto MCMC samples from the pullback yields asymptotically exact samples from π. Three leapfrog steps were used in the HMC proposal, and the step sizes were chosen adaptively during the burn-in period of the chains to obtain acceptance rates between 70% and 90% [3, 5, 6]. 4.3 Bayesian neural network We now consider a Bayesian neural network, also in [18, 36], trained on the UCI yacht hydrodynamics data set [2]. Our inference problem is 581-dimensional, given a network input dimension of 6, one hidden layer of dimension 20, and an output layer of dimension 1. We use sigmoid activations in the input and hidden layer, and a linear output layer. Model parameters are endowed with independent Gaussian priors with zero mean and variance 100. Further details are in Appendix G.2. Here we consider affine maps as the underlying class of transport. This yields Gaussian approximations to the posterior distribution in both the lazy and baseline cases. We compare a baseline affine map and G3-affine, denoting a 3-layer lazy map where each layer has rank r = 200. The diagnostic matrices HB ℓare computed using 581 standard normal samples. We note improvements in each of the performance metrics using the lazy framework, summarized in Table 1. We also note a 64% decrease in the number of trained flow parameters in G3-affine, relative to the baseline case (from 338142 to 120600). Similarly to 4.2, we compare the performance of HMC applied with and without transport map preconditioning. We achieve worst, best, and average component-wise ESS of 0.073%, 1.2%, and 0.56% using the learned G3-Affine map, compared to 0.047%, 0.14%, and 0.06% without a transport map. Here five leapfrog steps were used in the HMC proposal, and the step sizes in each case were picked adaptively as before. 4.4 High-dimensional elliptic PDE inverse problem We consider the problem of estimating the diffusion coefficient eκ(x) of an elliptic PDE from sparse observations of the field u(x) solving ( (eκ(x) u(x)) = 0, for x D := [0, 1]2 , u(x) = 0 for x1 = 0, u(x) = 1 for x1 = 1, u(x) n = 0 for x2 {0, 1} . (6) (a) Data generating field κ 0 2 4 6 8 10 Lazy iteration 1 2V[log /T ] (b) Convergence (c) Realizations of κ π(κ|y ) Figure 3: Application of Algorithm 2 to an elliptic PDE with unknown diffusion coefficient. (a) The data-generating field κ. (b) Convergence of the trace error bound and variance diagnostic with greedy iterations. (c) Draws from the 2601-dimensional posterior distribution. This PDE is discretized using finite elements over a uniform mesh of 51 51 nodes, leading to d = 2601 degrees of freedom. We denote by κ the discretized version of the log-diffusion coefficient over this mesh. Let F be the map from the parameter κ to n = 81 values of u collected at the locations shown in Figure 3a. Observations follow the model y = F(κ) + ϵ, where ϵ N(0, Σobs) and Σobs := 10 3Id. The coefficient κ is endowed with a Gaussian prior N(0, Σ) where Σ is the covariance of an Ornstein Uhlenbeck process. For the observations y associated to the parameter κ shown in Figure 3a, our target distribution is π(z) Ly (z)ρ(z), where κ = Σ1/2z. We greedily train a deeply lazy map using Algorithm 2, using triangular polynomial maps as the underlying transport (see Appendix B). Expectations appearing in the algorithm are discretized with m = 500 Monte Carlo samples. To not waste work in the early iterations, we use affine maps of rank r = 4 for iterations ℓ= 1, . . . , 5. Then we switch to polynomial maps of degree 2 and rank r = 2 for the remaining iterations. This reflects the flexibility of the lazy framework; changes to the underlying transport class and the lazy rank of each layer are simple to implement. The algorithm is terminated when it stagnates after exhausting the expressiveness of the underlying transport class, and the precision of approximating the objective using m samples; see Figure 3b. Randomly drawn realizations of κ in Figure 3c resemble the generating field. This elliptic PDE is a challenging benchmark problem for high-dimensional inference [4, 16, 53]. We note that the final map is a sparse degree-32 polynomial that acts nonlinearly on all 2061 degrees of freedom. Without imposing structure, the curse of dimensionality would render the solution of this problem using polynomial transport maps completely intractable [56]. For instance, a naïve totaldegree parameterization of just the final component of the map would contain 2061+32 32 5.5 1070 parameters. We can confirm the quality of the posterior approximation and demonstrate a further application of transport by using MCMC to sample the pullback T ℓπ. We do so using preconditioned Crank-Nicolson (p CN) MCMC [15] (a state-of-the-art algorithm for PDE problems, with dimensionindependent convergence rate) with a step size parameter β = 0.5. The acceptance rate is 28.2% with the worst, best, and average effective sample sizes [58] being 0.2%, 2.6%, and 1.5% of the complete chain. For comparison, a direct application of p CN with the same β leads to an acceptance rate under 0.4% and an effective sample size that cannot be reliably computed. More details are in Appendix G.3. 5 Conclusions We have presented a framework for creating target-informed architectures for transport-based variational inference. Our approach uses a rigorous error bound to identify low-dimensional structure in the target distribution and focus the expressiveness of the transport map or flow on an important subspace. We also introduce and analyze a greedy algorithm for building deep compositions of low-dimensional maps that can iteratively approximate general high-dimensional target distributions. Empirically, these methods improve the accuracy of inference, accelerate training, and control the complexity of flows to improve tractability. Ongoing work will consider constructive tests for further varieties of underlying structure in inference problems, and their implications on the structure of flows. Broader Impact Who may benefit from this research? We believe users and developers of approximate inference methods will benefit from our work. Our framework works as an outer wrapper that can improve the effectiveness of any flow-based variational inference method by guiding its structure. We hope to make expressive flow-based variational inference more tractable, efficient, and broadly applicable, particularly in high dimensions, by developing automated tests for low-dimensional structure and flexible ways to exploit it. The trace diagnostic developed in our work rigorously assesses the quality of transport/flow-based inference, and may be of independent interest. Who may be put at disadvantage from this research? We don t believe anyone is put at disadvantage due to this research. What are the consequences of failure of the system? We specifically point out that one contribution of this work is identifying when a poor posterior approximation has occurred. A potential failure mode of our framework would be inaccurate estimation of the diagnostic matrix H or its spectrum, suggesting that the approximate posterior is more accurate than it truly is. However, computing the eigenvalues or trace of a symmetric matrix, even one estimated from samples, is a well studied problem. And numerical software guards against poor eigenvalue estimation or at least warns if this occurs. We believe the theoretical underpinnings of this work make it robust to undetected failure. Does the task/method leverage biases in the data? We don t believe our method leverages data bias. As a method for variational inference, our goal is to accurately approximate a posterior distribution. It is very possible to encode biases for/against a particular result in a Bayesian inference problem, but that occurs at the level of modeling (choosing the prior, defining the likelihood) and collecting data, not at the level of approximating the posterior. Acknowledgments and Disclosure of Funding This work was supported in part by the US Department of Energy, Office of Advanced Scientific Computing Research, AEOLUS (Advances in Experimental Design, Optimal Control, and Learning for Uncertain Complex Systems) project. The authors also gratefully acknowledge support from the Inria associate team UNQUESTIONABLE. [1] https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+ Classification. [2] http://archive.ics.uci.edu/ml/datasets/yacht+hydrodynamics. [3] C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343 373, 2008. [4] A. Beskos, M. Girolami, S. Lan, P. E. Farrell, and A. M. Stuart. Geometric MCMC for infinite-dimensional inverse problems. Journal of Computational Physics, 335:327 351, 2017. [5] A. Beskos, N. Pillai, G. Roberts, J.-M. Sanz-Serna, A. Stuart, et al. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501 1534, 2013. [6] M. Betancourt, S. Byrne, and M. Girolami. Optimizing the integrator step size for Hamiltonian Monte Carlo. ar Xiv preprint ar Xiv:1411.6669, 2014. [7] D. Bigoni. Transport Maps. http://transportmaps.mit.edu/, 2016 2020. [8] D. Bigoni, A. Spantini, and Y. Marzouk. Adaptive construction of measure transports for Bayesian inference. NIPS workshop on Approximate Inference, 2016. [9] V. I. Bogachev, A. V. Kolesnikov, and K. V. Medvedev. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309, 2005. [10] C. G. Broyden. The Convergence of a Class of Double Rank Minimization Algorithms. Part {II}. J. Inst. Math. Appl., 6:222, 1970. [11] G. Carlier, A. Galichon, and F. Santambrogio. From Knothe s transport to Brenier s map and a continuation method for optimal transport. SIAM Journal on Mathematical Analysis, 41(6):2554 2576, 2010. [12] P. Chen, K. Wu, J. Chen, T. O Leary-Roseberry, and O. Ghattas. Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions. ar Xiv e-prints, Jan. 2019. [13] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. Neur IPS, 2018. [14] O. F. Christensen, G. O. Roberts, and J. S. Rosenthal. Scaling limits for the transient phase of local Metropolis-Hastings algorithms. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 67(2):253 268, 2005. [15] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster. Statistical Science, 28(3):424 446, 2013. [16] T. Cui, K. J. Law, and Y. M. Marzouk. Dimension-independent likelihood-informed MCMC. Journal of Computational Physics, 304:109 137, 2016. [17] N. De Cao, I. Titov, and W. Aziz. Block neural autoregressive flow. ar Xiv preprint ar Xiv:1904.04676, 2019. [18] G. Detommaso, T. Cui, A. Spantini, Y. Marzouk, and R. Scheichl. A Stein variational Newton method. Neur IPS, 2018. [19] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous. Tensorflow distributions. ar Xiv preprint ar Xiv:1711.10604, 2017. [20] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. ar Xiv:1605.08803, 2016. [21] E. Dupont, A. Doucet, and Y. W. Teh. Augmented Neural ODEs. ar Xiv preprint ar Xiv:1904.01681, 2019. [22] P. E. Farrell, D. A. Ham, S. W. Funke, and M. E. Rognes. Automated Derivation of the Adjoint of High-Level Transient Finite Element Programs. SIAM Journal on Scientific Computing, 35(4):C369 C393, jan 2013. [23] A. Gholaminejad, K. Keutzer, and G. Biros. Anode: Unconditionally accurate memory-efficient gradients for neural odes. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 730 736, 2019. [24] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011. [25] G. H. Golub and J. H. Welsch. Calculation of Gauss quadrature rules. Mathematics of Computation, 23(106):221 221, may 1969. [26] M. Hoffman, P. Sountsov, J. V. Dillon, I. Langmore, D. Tran, and S. Vasudevan. Neutralizing bad geometry in hamiltonian monte carlo using neural transport. ar Xiv preprint ar Xiv:1903.03704, 2019. [27] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive flows. ar Xiv preprint ar Xiv:1804.00779, 2018. [28] P. J. Huber. Projection pursuit. The Annals of Statistics, pages 435 475, 1985. [29] P. Jaini, K. A. Selby, and Y. Yu. Sum-of-Squares Polynomial Flow. ar Xiv:1905.02325, 2019. [30] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [31] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4743 4751. Curran Associates, Inc., 2016. [32] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pages 2575 2583, 2015. [33] H. Knothe et al. Contributions to the theory of convex bodies. The Michigan Mathematical Journal, 4(1):39 52, 1957. [34] I. Kobyzev, S. Prince, and M. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [35] Q. Liu. Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pages 3115 3123, 2017. [36] Q. Liu and D. Wang. Stein Variational Gradient Descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, pages 2370 2378, 2016. [37] A. Logg and G. N. Wells. Dolfin: Automated finite element computing. ACM Transactions on Mathematical Software, 37(2), 2010. [38] Y. Marzouk, T. Moselhy, M. Parno, and A. Spantini. Sampling via measure transport: An introduction. In Handbook of Uncertainty Quantification, R. Ghanem, D. Higdon, and H. Owhadi, editors. Springer, 2016. [39] J. Møller, A. R. Syversveen, and R. Waagepetersen. Log Gaussian Cox Processes. Scandinavian Journal of Statistics, 25(3):451 482, 1998. [40] T. Moselhy and Y. Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815 7850, 2012. [41] R. M. Neal et al. MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011. [42] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. ar Xiv preprint ar Xiv:1912.02762, 2019. [43] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338 2347, 2017. [44] M. Parno and Y. M. Marzouk. Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645 682, 2018. [45] B. Peherstorfer and Y. Marzouk. A transport-based multifidelity preconditioner for Markov chain Monte Carlo. Advances in Computational Mathematics, 45(5-6):2321 2348, 2019. [46] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. ar Xiv:1505.05770, 2015. [47] M. Rosenblatt. Remarks on a multivariate transformation. The Annals of Mathematical Statistics, pages 470 472, 1952. [48] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B, 71(2):319 392, 2009. [49] C. O. Sakar, G. Serbes, A. Gunduz, H. C. Tunc, H. Nizam, B. E. Sakar, M. Tutuncu, T. Aydin, M. E. Isenkul, and H. Apaydin. A comparative analysis of speech signal processing algorithms for parkinson s disease classification and the use of the tunable q-factor wavelet transform. Applied Soft Computing, 74:255 263, 2019. [50] F. Santambrogio. Optimal Transport for Applied Mathematicians, volume 87. Springer, 2015. [51] S. Smolyak. Quadrature and interpolation formulas for tensor products of certain classes of functions. Dokl. Akad. Nauk SSSR, 1963. [52] A. Spantini, D. Bigoni, and Y. Marzouk. Inference via low-dimensional couplings. The Journal of Machine Learning Research, 19(1):2639 2709, 2018. [53] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numerica, 19:451 559, 2010. [54] E. Tabak and C. V. Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145 164, 2013. [55] V. N. Temlyakov. Greedy approximation. Acta Numerica, 17:235 409, 2008. [56] L. N. Trefethen. Cubature, approximation, and isotropy in the hypercube. SIAM Review, 59(3):469 491, 2017. [57] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [58] U. Wolff. Monte Carlo errors with less errors. Computer Physics Communications, 156(2):143 153, 2004. [59] O. Zahm, T. Cui, K. Law, A. Spantini, and Y. Marzouk. Certified dimension reduction in nonlinear Bayesian inverse problems. ar Xiv preprint ar Xiv:1807.03712, 2018.