# greedy_inference_with_structureexploiting_lazy_maps__3c116624.pdf

Greedy inference with structure-exploiting lazy maps

Michael C. Brennan Massachusetts Institute of Technology Cambridge, MA 02139 USA mcbrenn@mit.edu

Daniele Bigoni Massachusetts Institute of Technology Cambridge, MA 02139 USA dabi@mit.edu

Olivier Zahm Université Grenoble Alpes, INRIA, CNRS, LJK 38000 Grenoble, France olivier.zahm@inria.fr

Alessio Spantini Massachusetts Institute of Technology Cambridge, MA 02139 USA alessio.spantini@gmail.com

Youssef Marzouk Massachusetts Institute of Technology Cambridge, MA 02139 USA ymarz@mit.edu

We propose a framework for solving high-dimensional Bayesian inference problems using structure-exploiting low-dimensional transport maps or ﬂows. These maps are conﬁned to a low-dimensional subspace (hence, lazy), and the subspace is identiﬁed by minimizing an upper bound on the Kullback Leibler divergence (hence, structured). Our framework provides a principled way of identifying and exploiting low-dimensional structure in an inference problem. It focuses the expressiveness of a transport map along the directions of most signiﬁcant discrepancy from the posterior, and can be used to build deep compositions of lazy maps, where low-dimensional projections of the parameters are iteratively transformed to match the posterior. We prove weak convergence of the generated sequence of distributions to the posterior, and we demonstrate the beneﬁts of the framework on challenging inference problems in machine learning and differential equations, using inverse autoregressive ﬂows and polynomial maps as examples of the underlying density estimators.

1 Introduction

Inference in the Bayesian setting typically requires the computation of integrals R f dπ over an intractable posterior distribution whose density2 π is known up to a normalizing constant. One approach to this problem is to construct a deterministic nonlinear transformation, i.e., a transport map [57], that induces a coupling of π with a tractable distribution ρ (e.g., a standard Gaussian). Formally, we seek a map T that pushes forward ρ to π, written as T ρ = π, such that the change of variables R f dπ = R f T dρ makes integration tractable.

Many constructions for such maps have been developed in recent years. Normalizing ﬂows (see [34, 42, 46, 54] and references therein) build transport maps via a deep composition of functions

These authors contributed equally to this work. 2In this paper, we only consider distributions that are absolutely continuous with respect to the Lebesgue measure on Rd, and thus will use the notation π to denote both the distribution and its associated density.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

parameterized by neural networks, with certain ansatzes to enable efﬁcient computation. Many recently proposed autoregressive ﬂows (for example [17, 20, 27, 31, 43]) compose triangular maps, which allow for efﬁcient evaluation of Jacobian determinants. In general, triangular maps [9, 33, 47] are sufﬁciently general to couple any absolutely continuous pair of distributions (ρ, π), and their numerical approximations have been investigated in [29, 38, 40, 52]. The ﬂow map of a neural ordinary differential equation [13, 21, 23] can also be seen as an inﬁnite-layer limit of a normalizing ﬂow. Alternatively, Stein variational methods [18, 35, 36] provide a nonparametric way of constructing T as a composition of functions lying in a chosen RKHS.

In general, it can be difﬁcult to represent expressive maps in high dimensions. For example, triangular maps on Rd must describe d-variate functions and thus immediately encounter the curse of dimensionality. Similarly, kernel-based methods lose expressiveness in high dimensions [12, 18]. Flow-based methods often increase expressiveness by adding layers, but this is typically performed in an ad hoc or unstructured way, which also requires tuning.

Here we propose a framework for inference that creates target-informed architectures around any class of transport maps or normalizing ﬂows. In particular, our framework uses rigorous a priori error bounds to discover and exploit low-dimensional structure in a given target distribution. It also provides a methodology for efﬁciently solving high-dimensional inference problems via greedily constructed compositions of structured low-dimensional maps.

The impact of our approach rests on two observations. First, the coordinate basis in which one expresses a transport map (i.e., T(x) versus UT(x), where U is a rotation on Rd) can strongly affect the training behavior and ﬁnal performance of the method. Our framework identiﬁes an ordered basis that best reveals a certain low-dimensional structure in the problem. Expressing the transport map in this basis focuses the expressiveness of the underlying transport class and allows for principled dimension reduction. This basis is identiﬁed by minimizing an upper bound on the Kullback Leibler (KL) divergence between π and its approximation, which follows from logarithmic Sobolev inequalities (see [59]) relating the KL divergence to gradients of the target density.

Second, in the spirit of normalizing ﬂows, we seek to increase the expressiveness of a transport map using repeated compositions. Rather than specifying the length of the ﬂow before training, we increase the length of the ﬂow sequentially. For each layer, we apply the framework above to a residual distribution that captures the deviation between the target distribution and its current approximation. We prove weak convergence of this greedy approach to the target distribution under reasonable assumptions. This sequential framework enables efﬁcient layer-wise training of highdimensional maps, which especially helps control the curse of dimensionality in certain transport classes. As we shall demonstrate empirically, the greedy composition approach can further improve accuracy at the end of training, compared to baseline methods.

Since Markov chain Monte Carlo (MCMC) methods are also a workhorse of inference, it is useful to contrast them with the variational methods discussed above. In general, these two classes of methods have different computational patterns. In variational inference, one might spend considerable effort to construct the approximate posterior, but afterwards enjoys cheap access to samples and normalized evaluations of the (approximate) target density. How well the approximation matches the true posterior depends on the expressiveness of the approximation class and on one s ability to optimize within this class. MCMC, in contrast, requires continual computational effort (even after tuning), but (asymptotically) generates samples from the exact posterior. Yet there is a line of work that uses transport to improve the performance of MCMC methods ([26, 44]) such that even if one desires exact samples, constructing a transport map can be beneﬁcial. We will demonstrate this link in our numerical experiments.

Preliminaries. We will consider target distributions with densities π on Rd that are differentiable almost everywhere and that can be evaluated up to a normalizing constant. Such a target will often be the posterior of a Bayesian inference problem, e.g., π(x) := p(x|y) Ly(x)π0(x), where Ly(x) := p(y|x) is the likelihood function and π0 is the prior. We denote the standard Gaussian density on Rd as ρ. We will consider maps T : Rd Rd that are diffeomorphisms,3 and with some abuse of

3In general T does not need to be a diffeomorphism, but only a particular invertible map; see Appendix B for more details. The distributions we will consider in this paper, however, fulﬁll the necessary conditions for T to be differentiable almost everywhere.

notation, we will write the pushforward density of ρ under T as T ρ(x) := ρ T 1(x)| T 1(x)|. We will frequently also use the notion of a pullback distribution or density, written as T π := (T 1) π.

In 2 we show how to build a single map in the low-dimensional lazy format described above, and describe the class of posterior distributions that admit such structure. In 3 we develop a greedy algorithm for building deep compositions of lazy maps, which effectively decomposes any inference problem into a series of lower-dimensional problems. 4 presents numerical experiments highlighting the beneﬁts of the lazy framework. While our numerical experiments employ inverse autoregressive ﬂows [31] and polynomial transport maps [29, 40] as the underlying transport classes, we emphasize that the lazy framework is applicable to any class of transport.

2 Lazy maps

Given a unitary matrix U Rd d and an integer r d, let Tr(U) be the set that contains all the maps T : Rd Rd of the form

T(z) = U τ(z1, . . . , zr) z

= Urτ(z1, . . . , zr) + U z (1)

for some diffeomorphism τ : Rr Rr. Here Ur Rd r and U Rd (d r) are the matrices containing respectively the r ﬁrst and the d r last columns of U, and z = (zr+1, . . . , zd). Any map T Tr(U) is called a lazy map with rank bounded by r, as it is nonlinear only with respect to the ﬁrst r input variables z1, . . . , zr and the nonlinearity is contained in the low-dimensional subspace range(Ur). The next proposition gives a characterization of all the densities T ρ when T Tr(U).

Proposition 1 (Characterization of lazy maps). Let U Rd d be a unitary matrix and let r d. Then for any lazy map T Tr(U), there exists a strictly positive function f : Rr R>0 such that

T ρ(x) = f(U r x)ρ(x), (2)

for all x Rd where ρ is the density of the standard normal distribution. Conversely, any probability density function of the form f(U r x)ρ(x) admits a representation as in (2) for some T Tr(U).

The proof is given in Appendix A.1. By Proposition 1, any posterior density π(x) Ly(x)π0(x) with standard Gaussian prior π0 = ρ and with likelihood function given by Ly(x) f(U r x) can be written exactly as π = T ρ for some lazy map T Tr(U). In particular, posteriors of generalized linear models naturally fall into this class; see Appendix D for more details. Following [59, Section 2.1], the solution T Tr(U) to

DKL(π||T ρ) = min T Tr(U) DKL(π||T ρ),

is such that T ρ(x) = f (U r x)ρ(x), where f is the conditional expectation

f (xr) = E π(X)

ρ(X) |U r X = xr

Now that we know the optimal lazy map in Tr(U), it remains to ﬁnd a suitable matrix U and rank r. In Appendix A.2 we show that

DKL(π||T ρ) = DKL(π||ρ) DKL((U r ) πr||ρr), (3)

where ρr is the density of the standard normal distribution on Rr and (U r ) π is the density of U r X with X π. Thus, for ﬁxed r, minimizing DKL(π||T ρ) over U is the same as ﬁnding the most non-Gaussian marginal (U r ) π. Such an optimal U can be difﬁcult to ﬁnd in practice. The next proposition instead gives a computable bound on DKL(π||T ρ), which we will use to construct a U suitable for our algorithm. The proof is given in Appendix A.3. Proposition 2. Let (λi, ui) R 0 Rd be the i-th eigenpair of the eigenvalue problem Hui = λiui where H = R ( log π

ρ ) dπ. Let U = [u1, . . . , ud] Rd d be the matrix containing the eigenvectors of H. Then for any r d we have

DKL(π||T ρ) 1

2(λr+1 + . . . + λd). (4)

Proposition 2 suggests constructing U as the matrix of eigenvectors of H, and that a fast decay in the spectrum of H allows a lazy map with low r to accurately represent the true posterior. Indeed, one can guarantee DKL(π||T ρ) < ε by choosing r to be the smallest integer such that the left-hand side of (4) is below ε. In practice, since the complexity of representing and training a transport map may strongly depend on r, we can bound r by some rmax d associated with a computational budget for constructing T. This procedure is summarized in Algorithm 1.

The practical implementation of Algorithm 1 relies on the computation of H. Direct Monte Carlo estimation of H, however, requires generating samples from π, which is not feasible in practice. Instead one can use an importance sampling estimate, taking

k=1 ωk( log π

ρ (Xk))( log π

where {Xk}K k=1 are i.i.d. samples from ρ and ωk = π(Xk)

ρ(Xk) /(PK k =1 π(Xk ) ρ(Xk ) ) are self-normalized weights. This estimate can have signiﬁcant variance when ρ is a poor approximation to the target π (e.g., in the ﬁrst stage of the greedy algorithm in 3). In this case it is preferable to impose ωk = 1, which reduces variance but yields an biased estimator of H; instead, it is an unbiased estimator of HB = R ( log π

ρ ) dρ. As shown via the error bounds in [59, Sec. 3.3.2] this matrix still provides useful information regarding the construction of U. We consider the differences between the two estimators in Appendix E. Also, since the effective sample size (ESS) of the importance sampling estimate can be computed with little extra cost after collecting samples, one can use this ESS to choose whether to use H or HB. Other variance reduction methods may also be applicable. For example, simpliﬁcations or approximations to the expected outer product of score functions yield natural candidates for control variates.

In constructing a lazy map T of the form (1), one needs to identify a map τ : Rr Rr such that T ρ approximates the posterior. One can use any transport class to parameterize τ; Appendices B and C detail the particular maps used in our numerical experiments. In our setting we can only evaluate π up to a normalizing constant, and thus it is expedient to minimize the reverse KL divergence DKL(T ρ||π) = DKL(ρ||T π), as is typical in variational Bayesian methods which can be achieved by maximizing a Monte Carlo or quadrature approximation of Eρ log T π . This is equivalent to maximizing the evidence lower bound (ELBO) and using the reparameterization trick [32] to write the expectation over the base distribution ρ. Details on the numerical implementation of Algorithm 1 are given in Appendix F. We note that the lazy framework works to control the KL divergence in the inclusive direction, while optimizing the ELBO minimizes the KL divergence in the exclusive direction. We show empirically that this computational strategy provides performance improvements in both directions of the KL divergence between the true and approximate posterior, compared to a baseline that does not utilize the lazy framework.

Algorithm 1 Construction of a lazy map.

1: procedure LAZYMAP(π, ρ, ε, rmax) 2: Compute H = R ( log π

ρ ) dπ 3: Solve the eigenvalue problem Hui = λiui 4: Let r = rmax min{r d : 1

i>r λi ε} and assemble U = [u1, . . . , ud]. 5: Find T by solving min T Tr(U) DKL(T ρ||π) 6: return lazy map T 7: end procedure

3 Deeply lazy maps

The restriction r rmax in Algorithm 1 helps control the computational cost of constructing the lazy map, but unless a problem admits sufﬁcient lazy structure, T ρ may not adequately approximate the posterior. To extend the numerical beneﬁts of the lazy framework to general problems, we consider the deeply lazy map Tℓ, a composition of ℓlazy maps:

Tℓ= T1 . . . Tℓ, Tk Tr(U k),

Algorithm 2 Construction of a deeply lazy map

1: procedure LAYERSOFLAZYMAPS(π, ρ, ε, r, ℓmax) 2: Set π0 = π and ℓ= 0 3: while ℓ ℓmax and 1

2 Tr(Hℓ) ε do 4: ℓ ℓ+ 1 5: Compute Tℓ= LAZYMAP(πℓ 1, ρ, 0, r) Algorithm 1 6: Update Tℓ= Tℓ 1 Tℓ 7: Compute πℓ= (Tℓ) π 8: Compute Hℓ= R ( log πℓ

ρ )( log πℓ

ρ ) dπℓ 9: end while 10: return Tℓ= T1 Tℓ 11: end procedure

where each Tk is a lazy map associated with a different unitary matrix U k Rd d. For simplicity we consider the case where each lazy layer Tk has the same rank r, though it is trivial to allow the ranks to vary from layer to layer. In general, the composition of lazy maps is not itself a lazy map. For example, there exists U 1 = U 2 such that T2 = T1 T2 can depend nonlinearly on each input variable and so T2 cannot be written as in (1).

The diagnostic matrix H allows us to build deeply lazy maps in a greedy way. After ℓ 1 iterations, the composition of maps Tℓ 1 = T1 . . . Tℓ 1 has been constructed. We seek a unitary matrix U ℓ Rd d and a lazy map Tℓ Tr(U ℓ) such that (Tℓ 1 Tℓ) ρ best improves over (Tℓ 1) ρ as an approximation to the posterior. To this end, we deﬁne the residual distribution

πℓ 1 = (Tℓ 1) π,

i.e., the pullback of π through the current transport map Tℓ 1. Note that DKL(π||(Tℓ 1 Tℓ) ρ) = DKL(πℓ 1||(Tℓ) ρ). We thus build Tℓusing Algorithm 1, replacing the posterior π by the residual distribution πℓ 1. We then update the transport map to be Tℓ= Tℓ 1 Tℓand the residual density πℓ= (Tℓ) π.

We note that applying Proposition 2 to πℓwith r = 0 yields

DKL(π||(Tℓ) ρ) = DKL(πℓ||ρ) 1

2(λ1 + + λd) = 1

where we deﬁne the diagnostic matrix at iteration ℓas,

Hℓ= Z log πℓ

Our framework thus naturally exposes the error bound 1

2 Tr (Hℓ) on the forward KL divergence, which is of independent interest and applicable to any ﬂow-based method. We refer to this bound as the trace diagnostic.

This bound can also be used as a stopping criterion for the greedy algorithm; one can continue adding layers until the bound falls below some desired threshold. This construction is summarized in Algorithm 2, and details on its numerical implementation are given in Appendix F.

The next proposition gives a sufﬁcient condition on U ℓto guarantee the convergence of our greedy algorithm. The proof is given in Appendix A.4.

Proposition 3. Let U 1, U 2, . . . be a sequence of unitary matrices. For any ℓ 1, we let Tℓ Tr(U ℓ) be a lazy map that minimizes DKL(πℓ 1||(Tℓ) ρ), where πℓ 1 = (T1 . . . Tℓ 1) π. If there exists 0 < t 1 such that for any ℓ 1

DKL((U ℓ r ) πℓ 1||ρr) t sup U Rd d

s.t. UU =Id

DKL((U r ) πℓ 1||ρr), (5)

then (T1 . . . Tℓ) ρ converges weakly to π.

0 2 4 6 8 10 12 Lazy iteration

1 2V[log /T ]

(a) Convergence rate

(b) Target π (c) T 1π (d) T 2π

(e) T 3π (f) T 5π (g) T 8π

Figure 1: Convergence of the algorithm for the approximation of the rotated banana distribution. (a) Decay of the bound 1

2 Tr(HB ℓ) on the KL-divergence DKL(π (Tℓ) ρ) and the variance diagnostic 1 2Vρ[log ρ/T ℓπ]. (b) The target density π. (c g) The target distribution is progressively Gaussianized by the maps Tℓ.

Let us comment on the condition (5). Recall that the unitary matrix U that maximizes DKL((U r ) πℓ 1||ρr) is optimal; see (3). By (5), the case t = 1 means that U ℓis optimal at each iteration. This corresponds to an ideal greedy algorithm. The case 0 < t < 1 allows suboptimal choices for U ℓwithout losing the convergence property of the algorithm. Such a greedy algorithm converges even with a potentially crude selection of U ℓthat corresponds to a t close to zero. This also is why an approximation to Hℓis expected to be sufﬁcient; see Section 4. We emphasize that condition (5) must apply simultaneously to all layers for a given 0 < t 1. Following [55], one could relax this condition by replacing t with a sequence (tℓ) that goes to zero sufﬁciently slowly. This development is left for future work. Finally, note that Proposition 3 does not require any constraints on r, so we have convergence even with r = 1, where each layer only acts on a single direction at a time.

4 Numerical examples

We present numerical demonstrations of the lazy framework as follows. We ﬁrst illustrate Algorithm 2 on a 2-dimensional toy example, where we show the progressive Gaussianization of the posterior using a sequence of 1-dimensional lazy maps. We then demonstrate the beneﬁts of the lazy framework (Algorithms 1 and 2) in several challenging inference problems. We consider Bayesian logistic regression and a Bayesian neural network, and compare the performance of a baseline transport map to lazy maps using the same underlying transport class. We measure performance improvements in four ways: (1) the ﬁnal ELBO achieved by the transport maps after training; (2 and 3): the ﬁnal trace diagnostics 1

2 Tr(HB ℓ) and 1

2 Tr(Hℓ), which bound the error DKL(π||(Tℓ) ρ); and (4) the variance diagnostic 1

2Vρ[log ρ/T ℓπ], which is an asymptotic approximation of DKL((Tℓ) ρ||π) as (Tℓ) ρ π (see [40]). Finally, we highlight the advantages of greedily training lazy maps in a nonlinear problem deﬁned by a high-dimensional elliptic partial differential equation (PDE), often used for testing high-dimensional inference methods [4, 16, 53]. Here, the lazy framework is needed to make variational inference tractable by controlling the total number of map parameters. We also illustrate the utility of such ﬂows in preconditioning Markov chain Monte Carlo (MCMC) samplers [26, 44], or equivalently as a way of de-biasing the variational approximation on these three problems.

Numerical examples are implemented 4 both in the Transport Maps framework [7] and using the Tensor Flow probability library [19]. The PDE considered in 4.4 is discretized and solved using the FEni CS [37] and dolfin-adjoint [22] packages.

4Code for the numerical examples can be found at https://github.com/Michael CBrennan/lazymaps and http://bit.ly/2Qlel XF. Data for 4.4, G.4, and G.5 can be downloaded at http://bit.ly/2X09Ns8, http://bit.ly/2Hyt Qc0 and http://bit.ly/2Eug5ZR.

4.1 Illustrative toy example

We ﬁrst apply the algorithm on the standard problem of approximating the rotated banana distribution Q πX1,X2 deﬁned by X1 N(0.5, 0.8) and X2|X1 N(X2 1, 0.2), and where Q is a random rotation. We restrict ourselves to using a composition of rank-1 lazy maps. We consider degree 3 polynomial maps as the underlying transport class. We use Gauss quadrature rules of order 10 for the discretization of the KL divergence and the approximation of HB ℓ(m = 121 in Algorithm 3 and 5). Figure 1b shows the target distribution π := πX1,X2. Figure 1a shows the convergence of the algorithm both in terms of the trace diagnostic 1

2 Tr(HB ℓ) and in terms of the variance diagnostic. After two iterations the algorithm has explored all directions of R2, leading to a fast improvement. The convergence stagnates once the trade-off between the complexity of the underlying transport class and the accuracy of the quadrature has been saturated. Figures 1c g show the progressive Gaussianization of the residual distributions T ℓπ for different iterations ℓ.

4.2 Bayesian logistic regression

We now consider a high-dimensional Bayesian logistic regression problem using the UCI Parkinson s disease classiﬁcation data [1], studied in [49]. We consider the ﬁrst 500 provided attributes consisting mainly of patient audio extensions. This results in a d = 500 dimensional inference problem. We choose a relatively uninformative prior of N(0, 102Id). Here we consider inverse autoregressive ﬂows (IAFs) [31] for the underlying transport class. Details on the IAF structure, our choice of hyper-parameters, and training procedure are in Appendix C.

As noted in 2 and shown in Appendix D, generalized linear models can admit an exactly lazy structure, where the lazy rank r of the posterior is bounded by the number of observations. We demonstrate this by ﬁrst considering a small subset of 20 observations. Given a sufﬁciently expressive underlying transport class, a single lazy map of rank r = 20 can exactly capture the posterior. We compare four transport maps: a baseline IAF map; U-IAF, which is a 1-layer lazy map with rank r = d = 500 expressed in the computed basis U; Ur-IAF and Ur-IAF-500, which are 1-layer lazy maps of rank r = 20. The baseline IAF, U-IAF and Ur-IAF each use autoregressive networks with a hidden dimension equal to the input dimension of the ﬂow (d = 500 in the case of the baseline IAF and U-IAF, 20 in the case of Ur-IAF). For Ur-IAF-500, we use a hidden dimension of 500, resulting in a map with approximately the same number of ﬂow parameters as the baseline and U-IAF maps. Results are summarized in Table 1. We see improved performance in each of the lazy maps compared to the baseline. We also note that Ur-IAF outperforms U-IAF in each metric median, even though the U-IAF map has more ﬂow parameters than the Ur-IAF map (4008000 vs 6720). The Ur-IAF-500 map performs the best in each metric. This map has the highest ratio of map parameters to active dimensions. This highlights a key beneﬁt of the lazy framework: the ability to focus the expressiveness of a transport map along particular subspaces important to the capturing the posterior.

0 20 40 60 80 100 Eigenvalue index

101 102 103 104 105 106 107 108 Eig(HB) Eig(HB 1 )

Figure 2: Leading eigenvalues of the diagnostic matrices HB ℓfor the G3-IAF map applied to the full rank logistic regression problem. The spectrum ﬂattens and falls as the approximation to the posterior improves.

Next we consider a full rank Bayesian logistic regression problem using 605 observations. Here we compare a baseline IAF; U-IAF deﬁned as before; and a 3-layer lazy map trained via the greedy Algorithm 2, denoted G3-IAF. In G3-IAF, each layer has rank r = 200. Results are summarized in Table 1, and again we see improvements in each of the performance metrics compared to the baseline IAF. Recall that the basis U relates to a bound on the inclusive KL direction, while the objective function for map training within a layer optimizes the exclusive KL direction. Empirically we see beneﬁts in metrics relating to both directions. Interestingly, we observe that U-IAF achieves the greatest ELBO while G3-IAF achieves the lowest trace diagnostics. This suggests that using a larger number of lazy layers tends to lead to improvements to the inclusive KL divergence. Also, though we chose to use the same number of training iterations in each case, we observe that training of the lazy maps converges more quickly; see Appendix G.1 for addition details.

Table 1: Result summaries for the Bayesian logistic regression and Bayesian neural network examples. Values reported are the median and (interquartile range) across 10 trials with randomized initialization. Best performance is bolded. ELBO computed using the median of the baseline.

Map ELBO* ( ) Variance diagnostic ( ) Tr(HB ℓ)/2 ( ) Tr(Hℓ)/2 ( ) Low rank Bayesian logistic regression Baseline IAF 26.5 (3.88) 104 (9.19) 31.3 (15.6) U-IAF 6.72 (0.469) 7.48 (1.37) 37.4 (2.83) 20.2 (7.91) Ur-IAF 8.27 (0.249) 5.68 (0.935) 33.8 (4.29) 19.1 (7.68) Ur-IAF-500 10.9 (0.227) 1.66 (0.496) 8.89 (6.63) 6.19 (0.896) Full rank Bayesian logistic regression Baseline IAF 209 (21.4) 956 (68.6) 350 (178) U-IAF 26.4 (1.19) 130 (12.3) 623 (19.5) 287 (96.1) G3-IAF 1.68 (1.56) 109 (8.25) 510 (21.6) 219 (110) Bayesian neural network Baseline Afﬁne 1.6e4 (5.8e4) 3.5e5 (6.9e5) 960 (1.0e3) G3-Afﬁne 47.7 (2.33) 97.5 (6.47) 1.06e3 (56.2) 606 (201)

As discussed in the introduction, a powerful use case for transport maps is the ability to precondition an MCMC method as described in [26, 44, 45], i.e., using the computed map to improve the posterior geometry. Applying Hamiltonian Monte Carlo [41] to the full rank Bayesian logistic regression problem (in particular, sampling the pullback T ℓπ where Tℓis the learned U-IAF map), we achieve worst, best, and average component-wise effective sample sizes of 0.39%, 1.8%, and 0.99%, compared to 0.056%, 0.12%, and 0.065% without a transport map (sampling the target π directly). Note that applying Tℓto MCMC samples from the pullback yields asymptotically exact samples from π. Three leapfrog steps were used in the HMC proposal, and the step sizes were chosen adaptively during the burn-in period of the chains to obtain acceptance rates between 70% and 90% [3, 5, 6].

4.3 Bayesian neural network

We now consider a Bayesian neural network, also in [18, 36], trained on the UCI yacht hydrodynamics data set [2]. Our inference problem is 581-dimensional, given a network input dimension of 6, one hidden layer of dimension 20, and an output layer of dimension 1. We use sigmoid activations in the input and hidden layer, and a linear output layer. Model parameters are endowed with independent Gaussian priors with zero mean and variance 100. Further details are in Appendix G.2.

Here we consider afﬁne maps as the underlying class of transport. This yields Gaussian approximations to the posterior distribution in both the lazy and baseline cases. We compare a baseline afﬁne map and G3-afﬁne, denoting a 3-layer lazy map where each layer has rank r = 200. The diagnostic matrices HB ℓare computed using 581 standard normal samples. We note improvements in each of the performance metrics using the lazy framework, summarized in Table 1. We also note a 64% decrease in the number of trained ﬂow parameters in G3-afﬁne, relative to the baseline case (from 338142 to 120600).

Similarly to 4.2, we compare the performance of HMC applied with and without transport map preconditioning. We achieve worst, best, and average component-wise ESS of 0.073%, 1.2%, and 0.56% using the learned G3-Afﬁne map, compared to 0.047%, 0.14%, and 0.06% without a transport map. Here ﬁve leapfrog steps were used in the HMC proposal, and the step sizes in each case were picked adaptively as before.

4.4 High-dimensional elliptic PDE inverse problem

We consider the problem of estimating the diffusion coefﬁcient eκ(x) of an elliptic PDE from sparse observations of the ﬁeld u(x) solving ( (eκ(x) u(x)) = 0, for x D := [0, 1]2 , u(x) = 0 for x1 = 0, u(x) = 1 for x1 = 1, u(x)

n = 0 for x2 {0, 1} . (6)

(a) Data generating ﬁeld κ

0 2 4 6 8 10 Lazy iteration

1 2V[log /T ]

(b) Convergence

(c) Realizations of κ π(κ|y )

Figure 3: Application of Algorithm 2 to an elliptic PDE with unknown diffusion coefﬁcient. (a) The data-generating ﬁeld κ. (b) Convergence of the trace error bound and variance diagnostic with greedy iterations. (c) Draws from the 2601-dimensional posterior distribution.

This PDE is discretized using ﬁnite elements over a uniform mesh of 51 51 nodes, leading to d = 2601 degrees of freedom. We denote by κ the discretized version of the log-diffusion coefﬁcient over this mesh. Let F be the map from the parameter κ to n = 81 values of u collected at the locations shown in Figure 3a. Observations follow the model y = F(κ) + ϵ, where ϵ N(0, Σobs) and Σobs := 10 3Id. The coefﬁcient κ is endowed with a Gaussian prior N(0, Σ) where Σ is the covariance of an Ornstein Uhlenbeck process. For the observations y associated to the parameter κ shown in Figure 3a, our target distribution is π(z) Ly (z)ρ(z), where κ = Σ1/2z.

We greedily train a deeply lazy map using Algorithm 2, using triangular polynomial maps as the underlying transport (see Appendix B). Expectations appearing in the algorithm are discretized with m = 500 Monte Carlo samples. To not waste work in the early iterations, we use afﬁne maps of rank r = 4 for iterations ℓ= 1, . . . , 5. Then we switch to polynomial maps of degree 2 and rank r = 2 for the remaining iterations. This reﬂects the ﬂexibility of the lazy framework; changes to the underlying transport class and the lazy rank of each layer are simple to implement. The algorithm is terminated when it stagnates after exhausting the expressiveness of the underlying transport class, and the precision of approximating the objective using m samples; see Figure 3b. Randomly drawn realizations of κ in Figure 3c resemble the generating ﬁeld.

This elliptic PDE is a challenging benchmark problem for high-dimensional inference [4, 16, 53]. We note that the ﬁnal map is a sparse degree-32 polynomial that acts nonlinearly on all 2061 degrees of freedom. Without imposing structure, the curse of dimensionality would render the solution of this problem using polynomial transport maps completely intractable [56]. For instance, a naïve totaldegree parameterization of just the ﬁnal component of the map would contain 2061+32 32 5.5 1070 parameters. We can conﬁrm the quality of the posterior approximation and demonstrate a further application of transport by using MCMC to sample the pullback T ℓπ. We do so using preconditioned Crank-Nicolson (p CN) MCMC [15] (a state-of-the-art algorithm for PDE problems, with dimensionindependent convergence rate) with a step size parameter β = 0.5. The acceptance rate is 28.2% with the worst, best, and average effective sample sizes [58] being 0.2%, 2.6%, and 1.5% of the complete chain. For comparison, a direct application of p CN with the same β leads to an acceptance rate under 0.4% and an effective sample size that cannot be reliably computed. More details are in Appendix G.3.

5 Conclusions

We have presented a framework for creating target-informed architectures for transport-based variational inference. Our approach uses a rigorous error bound to identify low-dimensional structure in the target distribution and focus the expressiveness of the transport map or ﬂow on an important subspace. We also introduce and analyze a greedy algorithm for building deep compositions of low-dimensional maps that can iteratively approximate general high-dimensional target distributions. Empirically, these methods improve the accuracy of inference, accelerate training, and control the complexity of ﬂows to improve tractability. Ongoing work will consider constructive tests for further varieties of underlying structure in inference problems, and their implications on the structure of ﬂows.

Broader Impact

Who may beneﬁt from this research? We believe users and developers of approximate inference methods will beneﬁt from our work. Our framework works as an outer wrapper that can improve the effectiveness of any ﬂow-based variational inference method by guiding its structure. We hope to make expressive ﬂow-based variational inference more tractable, efﬁcient, and broadly applicable, particularly in high dimensions, by developing automated tests for low-dimensional structure and ﬂexible ways to exploit it. The trace diagnostic developed in our work rigorously assesses the quality of transport/ﬂow-based inference, and may be of independent interest.

Who may be put at disadvantage from this research? We don t believe anyone is put at disadvantage due to this research.

What are the consequences of failure of the system? We speciﬁcally point out that one contribution of this work is identifying when a poor posterior approximation has occurred. A potential failure mode of our framework would be inaccurate estimation of the diagnostic matrix H or its spectrum, suggesting that the approximate posterior is more accurate than it truly is. However, computing the eigenvalues or trace of a symmetric matrix, even one estimated from samples, is a well studied problem. And numerical software guards against poor eigenvalue estimation or at least warns if this occurs. We believe the theoretical underpinnings of this work make it robust to undetected failure.

Does the task/method leverage biases in the data? We don t believe our method leverages data bias. As a method for variational inference, our goal is to accurately approximate a posterior distribution. It is very possible to encode biases for/against a particular result in a Bayesian inference problem, but that occurs at the level of modeling (choosing the prior, deﬁning the likelihood) and collecting data, not at the level of approximating the posterior.

Acknowledgments and Disclosure of Funding

This work was supported in part by the US Department of Energy, Ofﬁce of Advanced Scientiﬁc Computing Research, AEOLUS (Advances in Experimental Design, Optimal Control, and Learning for Uncertain Complex Systems) project. The authors also gratefully acknowledge support from the Inria associate team UNQUESTIONABLE.

[1] https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+ Classification.

[2] http://archive.ics.uci.edu/ml/datasets/yacht+hydrodynamics.

[3] C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343 373, 2008.

[4] A. Beskos, M. Girolami, S. Lan, P. E. Farrell, and A. M. Stuart. Geometric MCMC for inﬁnite-dimensional inverse problems. Journal of Computational Physics, 335:327 351, 2017.

[5] A. Beskos, N. Pillai, G. Roberts, J.-M. Sanz-Serna, A. Stuart, et al. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501 1534, 2013.

[6] M. Betancourt, S. Byrne, and M. Girolami. Optimizing the integrator step size for Hamiltonian Monte Carlo. ar Xiv preprint ar Xiv:1411.6669, 2014.

[7] D. Bigoni. Transport Maps. http://transportmaps.mit.edu/, 2016 2020.

[8] D. Bigoni, A. Spantini, and Y. Marzouk. Adaptive construction of measure transports for Bayesian inference. NIPS workshop on Approximate Inference, 2016.

[9] V. I. Bogachev, A. V. Kolesnikov, and K. V. Medvedev. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309, 2005.

[10] C. G. Broyden. The Convergence of a Class of Double Rank Minimization Algorithms. Part {II}. J. Inst. Math. Appl., 6:222, 1970.

[11] G. Carlier, A. Galichon, and F. Santambrogio. From Knothe s transport to Brenier s map and a continuation method for optimal transport. SIAM Journal on Mathematical Analysis, 41(6):2554 2576, 2010.

[12] P. Chen, K. Wu, J. Chen, T. O Leary-Roseberry, and O. Ghattas. Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions. ar Xiv e-prints, Jan. 2019.

[13] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural Ordinary Differential Equations. Neur IPS, 2018.

[14] O. F. Christensen, G. O. Roberts, and J. S. Rosenthal. Scaling limits for the transient phase of local Metropolis-Hastings algorithms. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 67(2):253 268, 2005.

[15] S. L. Cotter, G. O. Roberts, A. M. Stuart, and D. White. MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster. Statistical Science, 28(3):424 446, 2013.

[16] T. Cui, K. J. Law, and Y. M. Marzouk. Dimension-independent likelihood-informed MCMC. Journal of Computational Physics, 304:109 137, 2016.

[17] N. De Cao, I. Titov, and W. Aziz. Block neural autoregressive ﬂow. ar Xiv preprint ar Xiv:1904.04676, 2019.

[18] G. Detommaso, T. Cui, A. Spantini, Y. Marzouk, and R. Scheichl. A Stein variational Newton method. Neur IPS, 2018.

[19] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous. Tensorﬂow distributions. ar Xiv preprint ar Xiv:1711.10604, 2017.

[20] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. ar Xiv:1605.08803, 2016.

[21] E. Dupont, A. Doucet, and Y. W. Teh. Augmented Neural ODEs. ar Xiv preprint ar Xiv:1904.01681, 2019.

[22] P. E. Farrell, D. A. Ham, S. W. Funke, and M. E. Rognes. Automated Derivation of the Adjoint of High-Level Transient Finite Element Programs. SIAM Journal on Scientiﬁc Computing, 35(4):C369 C393, jan 2013.

[23] A. Gholaminejad, K. Keutzer, and G. Biros. Anode: Unconditionally accurate memory-efﬁcient gradients for neural odes. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19, pages 730 736, 2019.

[24] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011.

[25] G. H. Golub and J. H. Welsch. Calculation of Gauss quadrature rules. Mathematics of Computation, 23(106):221 221, may 1969.

[26] M. Hoffman, P. Sountsov, J. V. Dillon, I. Langmore, D. Tran, and S. Vasudevan. Neutralizing bad geometry in hamiltonian monte carlo using neural transport. ar Xiv preprint ar Xiv:1903.03704, 2019.

[27] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville. Neural autoregressive ﬂows. ar Xiv preprint ar Xiv:1804.00779, 2018.

[28] P. J. Huber. Projection pursuit. The Annals of Statistics, pages 435 475, 1985.

[29] P. Jaini, K. A. Selby, and Y. Yu. Sum-of-Squares Polynomial Flow. ar Xiv:1905.02325, 2019.

[30] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[31] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive ﬂow. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4743 4751. Curran Associates, Inc., 2016.

[32] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pages 2575 2583, 2015.

[33] H. Knothe et al. Contributions to the theory of convex bodies. The Michigan Mathematical Journal, 4(1):39 52, 1957.

[34] I. Kobyzev, S. Prince, and M. Brubaker. Normalizing ﬂows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

[35] Q. Liu. Stein variational gradient descent as gradient ﬂow. In Advances in neural information processing systems, pages 3115 3123, 2017.

[36] Q. Liu and D. Wang. Stein Variational Gradient Descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems, pages 2370 2378, 2016.

[37] A. Logg and G. N. Wells. Dolﬁn: Automated ﬁnite element computing. ACM Transactions on Mathematical Software, 37(2), 2010.

[38] Y. Marzouk, T. Moselhy, M. Parno, and A. Spantini. Sampling via measure transport: An introduction. In Handbook of Uncertainty Quantiﬁcation, R. Ghanem, D. Higdon, and H. Owhadi, editors. Springer, 2016.

[39] J. Møller, A. R. Syversveen, and R. Waagepetersen. Log Gaussian Cox Processes. Scandinavian Journal of Statistics, 25(3):451 482, 1998.

[40] T. Moselhy and Y. Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815 7850, 2012.

[41] R. M. Neal et al. MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.

[42] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan. Normalizing ﬂows for probabilistic modeling and inference. ar Xiv preprint ar Xiv:1912.02762, 2019.

[43] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive ﬂow for density estimation. In Advances in Neural Information Processing Systems, pages 2338 2347, 2017.

[44] M. Parno and Y. M. Marzouk. Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantiﬁcation, 6(2):645 682, 2018.

[45] B. Peherstorfer and Y. Marzouk. A transport-based multiﬁdelity preconditioner for Markov chain Monte Carlo. Advances in Computational Mathematics, 45(5-6):2321 2348, 2019.

[46] D. J. Rezende and S. Mohamed. Variational inference with normalizing ﬂows. ar Xiv:1505.05770, 2015.

[47] M. Rosenblatt. Remarks on a multivariate transformation. The Annals of Mathematical Statistics, pages 470 472, 1952.

[48] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B, 71(2):319 392, 2009.

[49] C. O. Sakar, G. Serbes, A. Gunduz, H. C. Tunc, H. Nizam, B. E. Sakar, M. Tutuncu, T. Aydin, M. E. Isenkul, and H. Apaydin. A comparative analysis of speech signal processing algorithms for parkinson s disease classiﬁcation and the use of the tunable q-factor wavelet transform. Applied Soft Computing, 74:255 263, 2019.

[50] F. Santambrogio. Optimal Transport for Applied Mathematicians, volume 87. Springer, 2015.

[51] S. Smolyak. Quadrature and interpolation formulas for tensor products of certain classes of functions. Dokl. Akad. Nauk SSSR, 1963.

[52] A. Spantini, D. Bigoni, and Y. Marzouk. Inference via low-dimensional couplings. The Journal of Machine Learning Research, 19(1):2639 2709, 2018.

[53] A. M. Stuart. Inverse problems: a Bayesian perspective. Acta Numerica, 19:451 559, 2010.

[54] E. Tabak and C. V. Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145 164, 2013.

[55] V. N. Temlyakov. Greedy approximation. Acta Numerica, 17:235 409, 2008.

[56] L. N. Trefethen. Cubature, approximation, and isotropy in the hypercube. SIAM Review, 59(3):469 491, 2017.

[57] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

[58] U. Wolff. Monte Carlo errors with less errors. Computer Physics Communications, 156(2):143 153, 2004.

[59] O. Zahm, T. Cui, K. Law, A. Spantini, and Y. Marzouk. Certiﬁed dimension reduction in nonlinear Bayesian inverse problems. ar Xiv preprint ar Xiv:1807.03712, 2018.