# copulalike_variational_inference__0cdbaa14.pdf

Copula-like Variational Inference

Marcel Hirt Department of Statistical Science University College of London, UK marcel.hirt.16@ucl.ac.uk

Petros Dellaportas Department of Statistical Science University College of London, UK Department of Statistics Athens University of Economics and Business, Greece and The Alan Turing Institute, UK

Alain Durmus CMLA Ecole normale sup erieure Paris-Saclay, CNRS, Universit e Paris-Saclay, 94235 Cachan, France. alain.durmus@cmla.ens-cachan.fr

This paper considers a new family of variational distributions motivated by Sklar s theorem. This family is based on new copula-like densities on the hypercube with non-uniform marginals which can be sampled efﬁciently, i.e. with a complexity linear in the dimension d of the state space. Then, the proposed variational densities that we suggest can be seen as arising from these copula-like densities used as base distributions on the hypercube with Gaussian quantile functions and sparse rotation matrices as normalizing ﬂows. The latter correspond to a rotation of the marginals with complexity O(d log d). We provide some empirical evidence that such a variational family can also approximate non-Gaussian posteriors and can be beneﬁcial compared to Gaussian approximations. Our method performs largely comparably to state-of-the-art variational approximations on standard regression and classiﬁcation benchmarks for Bayesian Neural Networks.

1 Introduction

Variational inference [29, 68, 4] aims at performing Bayesian inference by approximating an intractable posterior density π with respect to the Lebesgue measure on Rd, based on a family of distributions which can be easily sampled from. More precisely, this kind of inference posits some variational family Q of densities (qξ)ξ Ξ with respect to the Lebesgue measure and intends to ﬁnd a good approximation qξ belonging to Q by minimizing the Kullback-Leibler (KL) with respect to π over Q, i.e. ξ arg minξ Ξ KL(qξ|π). Further, suppose that π(x) = e U(x)/Z with U : Rd R measurable and Z = R

Rd e U(x)dx < is an unknown normalising constant. Then, for any ξ Ξ,

KL(qξ|π) = Z

Rd qξ(x) log π(x)

qξ(x)dx = Eqξ(x) [ U(x) log qξ(x)] + log Z . (1)

Since Z does not depend on qξ, minimizing ξ 7 KL(qξ|π) is equivalent to maximizing ξ 7 log Z KL(qξ|π). A standard example is Bayesian inference over latent variables x having a prior density π0 for a given likelihood function L(y1:n|x) and n observations y1:n = (y1, . . . , yn). The target density is the posterior p(x|y1:n) with U(x) = log π0(x) log L(y1:n|x) and the objective that is commonly maximized,

L(ξ) = Eqξ(x) log π0(x) + log L(y1:n|x) log qξ(x) (2)

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

is called a variational lower bound or ELBO. One of the main features of variational inference methods is their ability to be scaled to large datasets using stochastic approximation methods [24] and applied to non-conjugate models by using Monte Carlo estimators of the gradient [57, 35, 60, 63, 38]. However, the approximation quality hinges on the expressiveness of the distributions in Q and restrictive assumptions on the variational family that allow for efﬁcient computations such as meanﬁeld families, tend to be too restrictive to recover the target distribution. Constructing an approximation family Q that is both ﬂexible to closely approximate the density of interest and at the same time computationally efﬁcient has been an ongoing challenge. Much effort has been dedicated to ﬁnd ﬂexible and rich enough variational approximations, for instance by assuming a Gaussian approximation with different types of covariance matrices. For example, full-rank covariance matrices have been considered in [1, 28, 63] and low-rank perturbations of diagonal matrices in [1, 46, 53, 47]. Furthermore, covariance matrices with a Kronecker structure have been proposed in [42, 70]. Besides, more complex variational families have been suggested: such as mixture models [18, 22, 46, 40, 39], implicit models [45, 26, 67, 69, 64], where the density of the variational distribution is intractable. Finally, variational inference based on normalizing ﬂows has been developed in [59, 34, 65, 43, 3]. As a special case and motivated by Sklar s theorem [62], variational inference based on families of copula densities and one-dimensional marginal distributions have been considered by [66] where it is assumed that the copula is a vine copula [2] and by [23] where the copula is assumed to be a Gaussian copula together with non-parametric marginals using Bernstein polynomials. Recall that c : [0, 1]d R+ is a copula if and only if its marginals are uniform on [0, 1], i.e. R

[0,1]d 1 c(u1, . . . , ud)du1 dui 1dui+1 dud = 1[0,1](ui) for any i {1, . . . , d} and ui R. In the present work, we pursue these ideas but propose instead of using a family of copula densities, simply a family of densities {cθ : [0, 1]d R+}θ Θ on the hypercube [0, 1]d. This idea is motivated from the fact that we are able to provide such a family which is both ﬂexible and allow efﬁcient computations. The paper is organised as follow. In Section 2, we recall how one can sample more expressive distributions and compute their densities using a sequence of bijective and continuously differentiable transformations. In particular, we illustrate how to apply this idea in order to sample from a target density by ﬁrst sampling a random variable U from its copula density c and then applying the marginal quantile function to each component of U. A new family of copula-like densities on the hypercube is constructed in Section 3 that allow for some ﬂexibility in their dependence structure, while enjoying linear complexity in the dimension of the state space for generating samples and evaluating log-densities. A ﬂexible variational distribution on Rd is introduced in Section 4 by sampling from such a copula-like density and then applying a sequence of transformations that include 1 2d log d rotations over pairs of coordinates. We illustrate in Section 6 that for some target densities arising for instance as the posterior in a logistic regression model, the proposed density allows for a better approximation as measured by the KL-divergence compared to a Gaussian density. We conclude with applying the proposed methodology on Bayesian Neural Network models.

2 Variational Inference and Copulas

In order to obtain expressive variational distributions, the variational densities can be transformed through a sequence of invertible mappings, termed normalizing ﬂows [60]. To be more speciﬁc, assume a series {Tt : Rd Rd}T t=1 of C1-diffeomorphisms and a sample X0 q0, where q0 is a density function on Rd. Then the random variable XT = TT TT 1 T1(X0) has a density q T that satisﬁes

log q T (x T ) = log q0(x)

t=1 log det Tt(xt)

with xt = Tt Tt 1 T1(x). To allow for scalable inferences with such densities, the transformations Tt must be chosen so that the determinant of their Jacobians can be computed efﬁciently. One possibility that satisﬁes this requirement is to choose volume-preserving ﬂows that have a Jacobian-determinant of one. This can be achieved by considering transformations Tt : x 7 Htx where Ht is an orthogonal matrix as proposed in [65] using a Householder-projection matrix Ht.

An alternative construction of the same form can be used to construct a density using Sklar s theorem [62, 48]. It establishes that given a target density π on (Rd, B(Rd)), there exists a continuous function C : [0, 1]d [0, 1] and a probability space supporting a random variable U = (U1, . . . , Ud) valued in [0, 1]d, such that for any x Rd, and u [0, 1]d,

P (U1 u1, , Ud ud) = C(u1, , ud) , Z x1

π(t)dt = C(F1(x1), . . . , Fd(xd))

(4) where for any i {1, . . . , d}, Fi is the cumulative distribution function associated with πi, so for any xi R, Fi(xi) = R xi πi(ti)dti and πi is the ith marginal of π, so for any xi R, πi(xi) = R

Rd 1 π(x)dx1 dxi 1dxi+1 dxd. To illustrate how one can obtain such a continuous function C and random variable U, recall that πi is assumed to be absolutely continuous with respect to the Lebesgue measure. Then for (X1, . . . , Xd) π, the random variable U = G 1(X) = (F1(X1), . . . , Fd(Xd)), where G : [0, 1]d Rd, with

G : u 7 (F 1 1 (u1), . . . , F 1 d (ud)), (5)

follows a law on the hypercube with uniform marginals. It can be readily shown that the cumulative distribution function C of U is continuous and satisﬁes (4). Note that taking the derivative of (4) yields

π(x) = c(F1(x1), . . . , Fd(xd))

i=1 πi(xi) ,

where c(u1, . . . , ud) = u1 ud C(u1, . . . , ud) is a copula density function by deﬁnition of C. One possibility to approximate a target density π is then to consider a parametric family of copula density functions (cθ)θ Θ for Θ Rpc and one parametric family of a d-dimensional vector of density functions (f1, . . . , fd)φ Φ : Rd Rd for Φ Rpf , and try to estimate θ Θ and φ Φ to get a good approximation of π via variational Bayesian methods. This idea was proposed by [23] and [66], where Gaussian and vine copulas were used, respectively. The main hurdle for using such family is their computational cost which can be prohibitive since the dimension of Θ is of order d2. We remark that for latent Gaussian models with certain likelihood functions, a Gaussian variational approximation can scale linearly in the number of observations by using dual variables, see [54, 31].

3 Copula-like Density

In this paper, we consider another approach which relies on a copula-like density function on [0, 1]d. Indeed, instead of an exact copula density function on [0, 1]d with uniform marginals, we consider simply a density function on [0, 1]d which allows to have a certain degree of freedom in the number of parameters we want to use. The family of copula-like densities that we consider is given by

cθ(v1, . . . , vd) = Γ(α )

( vαℓ 1 ℓ Γ(αℓ)

(v ) α max i {1,...,d} vi

a " 1 max i {1,...,d} vi

with the notation v = Pd i=1 vi and α = Pd i=1 αi. Therefore θ = (a, b, (αi)i {1,...,d}) (R + R + (R +)d) = Θ. The following probabilistic construction is proven in Appendix A to allow for efﬁcient sampling from the proposed copula-like density.

Proposition 1. Let θ Θ and suppose that

1. (W1, . . . , Wd) Dirichlet(α1, . . . , αd);

2. G Beta(a, b);

3. (V1, . . . , Vd) = (GW1/W , . . . , GWd/W ), where W = maxi {1,...,d} Wi.

Then the distribution of (V1, . . . , Vd) has density with respect to the Lebesgue measure given by (6).

The proposed distribution builds up on Beta distributions, as they are the marginals of the Dirichlet distributed random variable W Dir(α), which is then multiplied with an independent random variable G Beta(a, b). The resulting random variable Y = WG follows a Beta-Liouville distribution, which allows to account for negative dependence, inherited from the Dirichlet distribution through a Beta stick-breaking construction, as well as positive dependence via a common Beta-factor. More precisely, one obtains

Cor(Yi, Yj) = cij

α + 1 E[G]2

for some cij > 0 and α = Pd k=1 αk, cf. [13]. Proposition 1 shows that one can transform the Beta-Liouville distribution living within the simplex to one that has support on the full hypercube, while also allowing for efﬁcient sampling and log-density evaluations.

Now note that also V = (1 V1, . . . 1 Vd) is a sample on the hypercube if V cθ, as is the convex combination U = (U1, . . . , Ud), where Ui = δi Vi + (1 δi)(1 Vi) for any δ [0, 1]d. Put differently, we can write U = H (V ), where

H : v 7 (1 δ) Id +{diag(2δ) Id}v , (7)

and Id is the identity operator. It is straightforward to see that H is a C1-diffeomorphism for δ ([0, 1]\{0.5})d from the hypercube into I1 Id, where Ii = [δi, 1 δi] if δi [0, 0.5) and Ii = [1 δi, δi] if δi (0.5, 1]. Note that the Jacobian-determinant of H is efﬁciently computable and is simply equal to | Qd i=1(2δi 1)| for δ [0, 1]d.

We suggest to take initially at random δ [0, 1]d for the transformation H such that

P(δi = ϵ) = p and P(δi = 1 ϵ) = 1 p (8)

with p, ϵ (0, 1). In our experiments, we set ϵ = 0.01 and p = 1/2. We found that choosing a different (large enough) value of ϵ tends to yield no large difference, as this choice will get balanced by a different value of the standard deviation of the Gaussian marginal transformation. The motivation to consider U = H (V ) with V cθ was ﬁrst numerical stability since we need to compute quantile functions only on the interval [ϵ, 1 ϵ] using this transformation. Second, this transformation can increase the ﬂexibility of our proposed family. We found empirically that the components of V cθ tend to be non-negative in higher dimensions. However, using sometimes (more) the antithetic component of V by considering U = H (V ), the transformed density can also describe negative dependencies in high dimensions. What comes to mind to obtain a ﬂexible density is then to either optimize over the parameter δ parametrising the transformation H or considering δ as an auxiliary variable in the variational density, resorting to techniques developed for such hierarchical families, see for instance [58, 69, 64]. However, this proved challenging in an initial attempt, since for δi = 0.5, the transformation H becomes non-invertible, while restricting δ on say δ {ϵ, 1 ϵ}d, ϵ 0, seemed less easy to optimize. Consequently, we keep δ ﬁxed after sampling it initially according to (8). A sensible choice was p = 1/2 since it leads to a balanced proportion of components of δ equal to ϵ and 1 ϵ. However, the sampled value of δ might not be optimal and we illustrate in the next section how the variational density can be made more ﬂexible.

4 Rotated Variational Density

We propose to apply rotations to the marginals in order to improve on the initial orientation that results from the sampled values of δ. Rotated copulas have been used before in low dimensions, see for instance [36], however, the set of orthogonal matrices has d(d 1)/2 free parameters. We reduce the number of free parameters by considering only rotation matrices Rd that are given as a product of d/2 log d Givens rotations, following the FFT-style butterﬂy-architecture proposed in [16], see also [44] and [49] where such an architecture was used for approximating Hessians and kernel functions, respectively. Recall that a Givens rotation matrix [21] is a sparse matrix with one angle as its parameter that rotates two dimensions by this angle. If we assume for the moment that d = 2k, k N , then we consider k rotation matrices denoted O1, . . . Ok where for any i {1, . . . , k}, Oi contains d/2 independent rotations, i.e. is the product of d/2 independent Givens rotations. Givens rotations are arranged in a butterﬂy architecture that provides for a minimal number of rotations so that all coordinates can interact with one another in the rotation deﬁned by Rd. For illustration, consider

the case d = 4, where the rotation matrix is fully described using 4 1 parameters ν1, ν2, ν3 R by R4 = O1O2 with

c1 s1 0 0 s1 c1 0 0 0 0 c3 s3 0 0 s3 c3

c2 0 s2 0 0 c2 0 s2 s2 0 c2 0 0 s2 0 c2

c1c2 s1c2 c1s2 s1s2 s1c2 c1c2 s1s2 c1ss c3s2 s3s2 c3c2 s3cs s3s2 c3s2 s3c2 c3c2

where ci = cos(νi) and si = sin(νi). We provide a precise recursive deﬁnition of Rd in Appendix B where we also describe the case where d is not a power of two. In general, we have a computational complexity of O(d log d), due to the fact that Rd is a product of O(log d) matrices each requiring O(d) operations. Moreover, note that Rd is parametrized by d 1 parameters (νi)i {1...d 1} and each Oi can be implemented as a sparse matrix, which implies a memory complexity of O(d). Furthermore, since Oi is orthonormal, we have O 1 i = O i and | det Oi| = 1.

To construct an expressive variational distribution, we consider as a base distribution q0 the proposed copula-like density cθ. We then apply the transformations T1 = H and T2 = G . The operator G in (5) is deﬁned via quantile functions of densities f1, . . . , fd, for which we choose Gaussian densities with parameter φf = (µ1, . . . , µd, σ2 1, . . . , σ2 d) Rd Rd +. As a ﬁnal transformation, we apply the volume-preserving operator T3 : x 7 O1 Olog dx (9)

that has parameter φR = (ν1, . . . , νd 1) Rd 1. Altogether, the parameter for the marginal-like densities that we optimize over is φ = (φf, φR) and simulation from the variational density boils down to the following algorithm.

Algorithm 1 Sampling from the rotated copula-like density.

1: Sample (V1, . . . , Vd) cθ using Proposition 1. 2: Set U = H (V ) where H is deﬁned in (7). 3: Set X = G (U), where G is deﬁned in (5). 4: Set X = T3, where T3 is deﬁned in (9).

Note that we apply the rotations after we have transformed samples from the hypercube into Rd, as the hypercube is not closed under Givens rotations. The variational density can then be evaluated using the normalizing ﬂow formula (3). We optimize the variational lower bound L in (2) using reparametrization gradients, proposed by [35, 60, 63], but with an implicit reparametrization, cf. [14], for Dirichlet and Beta distributions. Such reparametrized gradients for Dirichlet and Beta distributions are readily available for instance in tensorﬂow probability [9]. Using Monte Carlo samples of unbiased gradient estimates, one can optimize the variational bound using some version of stochastic gradient descent. A more formal description is given in Appendix C.

We would like to remark that such sparse rotations can be similarly applied to proper copulas. While there is no additional ﬂexibility by rotating a full-rank Gaussian copula, applying such rotations to a Gaussian copula with a low-rank correlation yields a Gaussian distribution with a more ﬂexible covariance structure if combined with Gaussian marginals. In our experiments, we therefore also compare variational families constructed by sampling (V1, . . . , Vd) from an independence copula in step 1 in Algorithm 1, i.e. Vi are independent and uniformly distributed on [0, 1] for any i {1, . . . , d}, which results approximately in a Gaussian variational distribution if the effect of the transformation H is neglected. However, a more thorough analysis of such families is left for future work. Similarly, transformations different from the sparse rotations in step 4 in Algorithm 1 can be used in combination with a copula-like base density. Whilst we include a comparison with a simple Inverse Autoregressive Flow [34] in our experiments, a more exhaustive study of non-linear transformations is beyond the scope of this work.

5 Related Work

Conceptually, our work is closely related to [66, 23]. It differs from [66] in that it can be applied in high dimensions without having to search ﬁrst for the most correlated variables using for instance a sequential tree selection algorithm [11]. The approach in [23] considered a Gaussian dependence structure, but has only been considered in low-dimensional settings. On a more computational side,

our approach is related to variational inference with normalizing ﬂows [59, 34, 65, 43, 3]. In contrast to these works that introduce a parameter-free base distribution commonly in Rd as the latent state space, we also optimize over the parameters of the base distribution which is supported on the hypercube instead, although distributions supported for instance on the hypersphere as a state space have been considered in [7]. Moreover, such approaches have been often used in the context of generative models using Variational Auto-Encoders (VAEs) [35], yet it is in principle possible to apply the proposed variational copula-like inference in an amortized fashion for VAEs. A somewhat similar copula-like construction in the context of importance sampling has been proposed in [8]. However, sampling from this density requires a rejection step to ensure support on the hypercube, which would make optimization of the variational bound less straightforward. Lastly, [30] proposed a method to approximate copulas using mixture distributions, but these approximations have not been analysed neither in high dimensions nor in the context of variational inference.

6 Experiments

6.1 Bayesian Logistic Regression

Consider the target distribution π on (Rd, B(Rd)) arising as the posterior of a d-dimensional logistic regression, assuming a Normal prior π0 = N(0, τ 1I), τ = 0.01, and likelihood function L(yi|x) = f(yix ai), f(z) = 1/(1 + e z) with n observations yi { 1, 1} and ﬁxed covariates ai Rd for i {1, . . . n}. We analyse a previously considered synthetic dataset where the posterior distribution is non-Gaussian, yet it can be well approximated with our copula-like construction. Concretely, we consider the synthetic dataset with d = 2 as in [50], Section 8.4 and [32] by generating 30 covariates a R2 from a Gaussian N((1, 5) , I) for instances in the ﬁrst class, while we generate 30 covariates from N(( 5, 1) , 1.12I) for instances in the second class. Samples from the target distribution using a Hamiltonian Monte Carlo (HMC) sampler [12, 51] are shown in Figure 1a and one observes non-Gaussian marginals that are positively correlated with heavy right tails. Using a Gaussian variational approximation with either independent marginals or a full covariance matrix as shown in Figure 1b does not adequately approximate the target distribution. Our copula-like construction is able to approximate the target more closely, both without any rotations (Figure 1c) and with a rotation of the marginals (Figure 1d). This is also supported by the ELBO obtained for the different variational families given in Table 1.

Table 1: Comparison of the ELBO between different variational families for the logistic regression experiment.

Variational family ELBO

Mean-ﬁeld Gaussian -3.42 Full-covariance Gaussian -2.97 Copula-like without rotations -2.30 Copula-like with rotations -2.19

Figure 1: Target density for logistic regression using a HMC sampler in 1a with different variational approximations: Gaussian variational approximation with a full covariance matrix in 1b, copula-like variational approximation without any rotation in 1c and copula-like variational approximation with a rotation in 1d.

6.2 Centred Horseshoe Priors

We illustrate our approach in a hierarchical Bayesian model that posits a priori a strong coupling of the latent parameters. As an example, we consider a Horseshoe prior [6] that has been considered in the variational Gaussian copula framework in [23]. To be more speciﬁc, consider the generative model y|λ N(0, λ), with λ C+(0, 1), where C+ is a half-Cauchy distribution, i.e. X C+(0, b) has the density p(x) 1R+(x)/(x2 + b2) . Note that we can represent a half-Cauchy distribution with Inverse Gamma and Gamma distributions using X C+(0, b) X2|Y IG(1/2, 1/Y ); Y IG(1/2, 1/b2), see [52], with a rate parametrisation of the inverse gamma density p(x) 1R+(x)xa 1e b/x for X IG(a, b). We revisit the toy model in [23] ﬁxing y = 0.01.

The model thus writes in a centred form as η G(1/2, 1) and λ|η IG(1/2, η). Following [23], we consider the posterior density on R2 of the log-transformed variables (x1, x2) = (log η1, log λ1). In Figure 2, we show the approximate posterior distribution using a Gaussian family (2b) and a copulalike family (2c), together with samples from a HMC sampler (2a). A copula-like density yields a higher ELBO, see Table 2. The experiments in [23] have shown that a Gaussian copula with a nonparametric mixture model ﬁts the marginals more closely. To illustrate that it is possible to arrive at a more ﬂexible variational family by using a mixture of copula-like densities, we have used a mixture of 3 copula-like densities in Figure 2d. Note that it is possible to accommodate multi-modal marginals using a Gaussian quantile transformation with a copula-like density. Eventually, the ﬂexibility of the variational approximation can be increased using different complementary work. For instance, one could use the new density within a semi-implicit variational framework [69] whose parameters are the output of a neural network conditional on some latent mixing variable.

Table 2: Comparison of the ELBO between different variational families for the centred horseshoe model.

Variational family ELBO

Mean-ﬁeld Gaussian -1.24 Full-covariance Gaussian -0.04 Copula-like 0.04 3-mixture copula-like 0.08

Figure 2: Target density for the horseshoe model using a HMC sampler in 2a with different variational approximations: Gaussian variational approximation with a full covariance matrix in 2b, copula-like variational approximation including a rotation in 2c and a mixture of three copula-like densities with a one rotation and marginallike density in 2d.

6.3 Bayesian Neural Networks with Normal Priors

We consider an L-hidden layer fully-connected neural network where each layer l, 1 l L + 1 has width dl and is parametrised by a weight matrix W l Rdl 1 dl and bias vector bl Rdl. Let h1 Rd0 denote the input to the network and f be a point-wise non-linearity such as the Re LU function f(a) = max{0, a} and deﬁne the activations al Rdl by al+1 = P

i hl i W l i + bl

for l 1, and the post-activations as hl = f(al) for l 2. We consider a regression likelihood function L( |a L+2, σ) = N(a L+2, exp(0.5σ)), and denote the concatenation of all parameters W, b and σ as x. We assume independent Normal priors for the entries of the weight matrix and bias vector with mean 0 and variance σ2 0. Furthermore, we assume that log σ N(0, 16). Inference with the proposed variational family is applied on commonly considered UCI regression datasets, repeating the experimental set-up used in [15]. In particular, we use neural networks with Re LU activation functions and one hidden layer of size 50 for all datasets with the exception of the protein dataset that utilizes a hidden layer of size 100. We choose the hyper-parameter σ2 0 {0.01, 0.1, 1., 10., 100.} that performed best on a validation dataset in terms of its predictive log-likelihood. Optimization was performed using Adam [33] with a learning rate of 0.002. We compare the predictive performance of a copula-like density cθ and an independent copula as a base distribution in step 1 of Algorithm 1 and we apply different transformations T3 in step 4 of Algorithm 1: a) the proposed sparse rotation deﬁned in (9); b) T3 = Id; c) an afﬁne autoregressive transformation T3(x) = {x fµ(x)}exp( fα(x)), see [34], also known as an inverse autogressive ﬂow (IAF). Here fµ and fα are autoregressive neural networks parametrized by µ and α satisfying fµ(x)i

xj = fα(x)i

xj = 0 for i j and which can be computed in a single forward pass by properly masking the weights in the neural networks [17]. In our experiments, we use a one-hidden layer fully-connected network with width d IAF 1 = 50 for fµ and fα. Note that for a d-dimensional target density, the size of the weight matrices are of order d d IAF 1 , implying a higher complexity compared to the sparse rotation. We also compare the predictions against Bayes-by-Backprop [5] using a mean-ﬁeld model, with the results as reported in [47] for a mean-ﬁeld Bayes-by-Backprop and low-rank Gaussian approximation proposed therein called SLANG. Furthermore, we also report the results for Dropout inference [15]. The test root mean-squared errors are given in Table 3 and

Table 3: Variational approximations with transformations and different base distributions. Test root mean-squared error for UCI regression datasets. Standard errors in parenthesis.

Copula-like Independent copula Copula-like Independent copula with rotation with rotation with IAF with IAF

Boston 3.43 (0.22) 3.51 (0.30) 3.21 (0.27) 3.61 (0.28) Concrete 5.76 (0.14) 6.00 (0.13) 5.41 (0.10) 5.82 (0.11) Energy 0.55 (0.01) 2.28 (0.11) 0.53 (0.02) 1.30 (0.10) Kin8nm 0.08 (0.00) 0.08 (0.00) 0.08 (0.00) 0.08 (0.00) Naval 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) Power 4.02 (0.04) 4.19 (0.04) 4.05 (0.04) 4.15 (0.04) Wine 0.64 (0.01) 0.64 (0.01) 0.64 ( 0.01) 0.64 (0.01) Yacht 1.35 (0.08) 1.38 (0.12) 0.96 (0.06) 1.25 (0.09) Protein 4.20 (0.01) 4.51 (0.04) 4.31 (0.01) 4.51 (0.03)

Table 4: Copula-like variational approximation without rotations and benchmark results. Test root mean-squared error for UCI regression datasets. Standard errors in parenthesis.

Copula-like Bayes-by-Backprop SLANG Dropout without rotation results from [47] results from [47] results from [47]

Boston 3.22 (0.25) 3.43 (0.20) 3.21 (0.19) 2.97 (0.19) Concrete 5.64 (0.14) 6.16 (0.13) 5.58 (0.12) 5.23 (0.12) Energy 0.52 (0.02) 0.97 (0.09) 0.64 (0.04) 1.66 (0.04) Kin8nm 0.08 (0.00) 0.08 (0.00) 0.08 (0.00) 0.10 (0.01) Naval 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.01 (0.01) Power 4.05 (0.04) 4.21 (0.03) 4.16 (0.04) 4.02 (0.04) Wine 0.65 (0.01) 0.64 (0.01) 0.65 ( 0.01) 0.62 (0.01) Yacht 1.23 (0.08) 1.13 (0.06) 1.08 (0.09) 1.11 (0.09) Protein 4.31 (0.02) NA NA 4.27 (0.01)

Table 4; the predictive test log-likelihood can be ﬁnd in the Appendix E in Table 6 and Table 7. We can observe from Table 3 and Table 6 that using a copula-like base distribution instead of an independent copula improves the predictive performance, using either rotations or IAF as the ﬁnal transformation. The same tables also indicate that for a given base distribution, the IAF tends to outperform the sparse rotations slightly. Table 4 and Table 7 suggest that copula-like densities without any transformation in the last step can be a competitive alternative to a benchmark mean-ﬁeld or Gaussian approximation. Dropout tends to perform slightly better. However, note that Dropout with a Normal prior and a variational mixture distribution that includes a Dirac delta function as one component gives rise to a different objective, since the prior is not absolutely continuous with respect to the approximate posterior, see [25].

6.4 Bayesian Neural Networks with Structured Priors

We illustrate our approach on a larger Bayesian neural network. To induce sparsity for the weights in the network, we consider a (regularised) Horseshoe prior [56] that has also been used increasingly as an alternative prior in Bayesian neural network to allow for sparse variational approximations, see [41, 19] for mean-ﬁeld models and [20] for a structured Gaussian approximation. We consider again an L-hidden layer fully-connected neural network where we assume that the weight matrix W l Rdl 1 dl for any l {1, . . . , L + 1} and any i {1, . . . , dl 1} satisﬁes a priori

W l i |λl i, τ l, c N(0, (τ l λl i)2I) N(0, (τ lλl i))2I)N(0, c2), (10)

where( λi l)2 = c2(λl i)2/(c2 + τ 2(λl i)2), λl i C+(0, 1), τ l i C+(0, bτ) and c2 IG( ν

2 ) for some hyper-parameters bτ, ν, s2 > 0. The vector W (l) i represents all weights that interact with the i-th input neuron. The ﬁrst Normal factor in (10) is a standard Horseshoe prior with a per layer global parameter τ l that adapts to the overall sparsity in layer l and shrinks all weights in this layer to zero, due to the fact that C+(0, bτ) allows for substantial mass near zero. The local shrinkage

Table 5: MNIST prediction errors. Variational approximation with Horseshoe prior and size 200 200 Error Rate

Copula-like with rotations 1.70 % Copula-like without rotations 1.78 % Copula-like with IAF 2.04 % Independent copula with IAF 2.88 % Independent copula with rotations 2.90 % Mean-ﬁeld Gaussian 3.82 % Copula-like without rotations and δi = 0.99 for all i {1, . . . , d} 5.70 %

parameter λl i allow for signals in the i-th input neuron because C+(0, 1) is heavy-tailed. However, this can leave large weights un-shrunk, and the second Normal factor in (10) induces a Studenttν(0, s2) regularisation for weights far from zero, see [56] for details. We can rewrite the model in a non-centred form [55], where the latent parameters are a priori independent, see also [41, 27, 19, 20] for similar variational approximations. We write the model as ηl i G(1/2, 1), ˆλl i IG(1/2, 1),

κl G(1/2, 1/b2 τ), ˆτ l IG(1/2, 1), βl i N(0, I), W l i = τ l λl iβl i, τ l =

ˆτ lκl, λl i = q

ˆλl iηl i and ( λl i)2 = c2(λl i)2/(c2 + (τ l)2(λl i)2). The target density is the posterior of these variables, after applying a log-transformation if their prior is an (inverse) Gamma law. We performed classiﬁcation on MNIST using a 2-hidden layer fully-connected network where the hidden layers are of size 200 each. Further details about the algorithmic details are given in Appendix D. Prediction errors for the variational families as considered in the preceding experiments are given in Table 5. We again ﬁnd that a copula-like density outperforms the independent copula. Using a copula-like density without the rotation also performs competitively as long as one uses a balanced amount of its antithetic component via the transformation H with parameter δ; ignoring the transformation H or setting δi = 0.99 for all i {1, . . . , d} can limit the representative power of the variational family and can result in high predictive errors. The neural network function for the IAF considered here has two hidden layers of size 100 100. It can be seen that applying the rotations can be beneﬁcial compared to the IAF for the copula-like density, whereas the two transformations perform similarly for the independent base distribution. We expect that more ad-hoc tricks can be used to adjust the rotations to some computational budget. For instance, one could include additional rotations for a group of latent variables such as those within one layer. Conversely, one could consider the series of sparse rotations O1, , Ok, but with 2k < d, thereby allowing for rotations of the more adjacent latent variables only. Our experiment illustrates that the proposed approach can be used in high-dimensional structured Bayesian models without having to specify more model-speciﬁc dependency assumptions in the variatonal family. The prediction errors are in line with current work for fully connected networks using a Gaussian variational family with Normal priors, cf. [47]. Better predictive performance for a fully connected Bayesian network has been reported in [37] that use Real NVP [10] as a normalising ﬂows in a large network that is reparametrised using a weight normalization [61]. It becomes scalable by opting to consider only variational inference over the Euclidean norm of W l i and performing point estimation for the direction of the weight vector W l i /||W l i ||2. Such a parametrisation does not allow for a ﬂexible dependence structure of the weights within one layer; and such a model architecture should be complementary to the proposed variational family in this work.

7 Conclusion

We have addressed the challenging problem of constructing a family of distributions that allows for some ﬂexibility in its dependence structure, whilst also having a reasonable computational complexity. It has been shown experimentally that it can constitute a useful replacement of a Gaussian approximation without requiring many algorithmic changes.

Acknowledgements

Alain Durmus acknowledges support from Chaire Baye Scale P. Lafﬁtte and from Polish National Science Center grant: NCN UMO-2018/31/B/ST1/0025. This research has been partly ﬁnanced by the Alan Turing Institute under the EPSRC grant EP/N510129/1. The authors acknowledge the use of the UCL Myriad High Throughput Computing Facility (Myriad@UCL), and associated support services, in the completion of this work.

[1] David Barber and Christopher M Bishop. Ensemble learning for multi-layer networks. In Advances in neural information processing systems, pages 395 401, 1998.

[2] Tim Bedford and Roger M Cooke. Probability density decomposition for conditionally dependent random variables modeled by vines. Annals of Mathematics and Artiﬁcial intelligence, 32(1-4):245 268, 2001.

[3] Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester normalizing ﬂows for variational inference. ar Xiv preprint ar Xiv:1803.05649, 2018.

[4] David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017.

[5] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of The 32nd International Conference on Machine Learning, pages 1613 1622, 2015.

[6] Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465 480, 2010.

[7] Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyperspherical variational auto-encoders. ar Xiv preprint ar Xiv:1804.00891, 2018.

[8] Petros Dellaportas and Mike G Tsionas. Importance sampling from posterior distributions using copula-like approximations. Journal of Econometrics, 2018.

[9] Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A Saurous. Tensorﬂow distributions. ar Xiv preprint ar Xiv:1711.10604, 2017.

[10] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016.

[11] Jeffrey Dissmann, Eike C Brechmann, Claudia Czado, and Dorota Kurowicka. Selecting and estimating regular vine copulae and application to ﬁnancial returns. Computational Statistics & Data Analysis, 59:52 69, 2013.

[12] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195(2):216 222, 1987.

[13] Kai Wang Fang. Symmetric Multivariate and Related Distributions. Chapman and Hall/CRC, 2017.

[14] Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pages 441 452, 2018.

[15] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[16] Alan Genz. Methods for generating random orthogonal matrices. Monte Carlo and Quasi Monte Carlo Methods, pages 199 213, 1998.

[17] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881 889, 2015.

[18] Samuel J Gershman, Matthew D Hoffman, and David M Blei. Nonparametric variational inference. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 235 242. Omnipress, 2012.

[19] Soumya Ghosh and Finale Doshi-Velez. Model selection in bayesian neural networks via horseshoe priors. ar Xiv preprint ar Xiv:1705.10388, 2017.

[20] Soumya Ghosh, Jiayu Yao, and Finale Doshi-Velez. Structured variational learning of bayesian neural networks with horseshoe priors. ar Xiv preprint ar Xiv:1806.05975, 2018.

[21] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.

[22] Fangjian Guo, Xiangyu Wang, Kai Fan, Tamara Broderick, and David B Dunson. Boosting variational inference. ar Xiv preprint ar Xiv:1611.05559, 2016.

[23] Shaobo Han, Xuejun Liao, David Dunson, and Lawrence Carin. Variational gaussian copula inference. In Artiﬁcial Intelligence and Statistics, pages 829 838, 2016.

[24] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013.

[25] J Hron, De G Matthews, Z Ghahramani, et al. Variational bayesian dropout: Pitfalls and ﬁxes. In 35th International Conference on Machine Learning, ICML 2018, volume 5, pages 3199 3219, 2018.

[26] Ferenc Husz ar. Variational inference using implicit distributions. ar Xiv preprint ar Xiv:1702.08235, 2017.

[27] John Ingraham and Debora Marks. Variational inference for sparse and undirected models. In International Conference on Machine Learning, pages 1607 1616, 2017.

[28] Tommi Jaakkola and Michael Jordan. A variational approach to bayesian logistic regression models and their extensions. In Sixth International Workshop on Artiﬁcial Intelligence and Statistics, volume 82, page 4, 1997.

[29] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999.

[30] Mohamad A Khaled and Robert Kohn. On approximating copulas by ﬁnite mixtures. ar Xiv preprint ar Xiv:1705.10440, 2017.

[31] Mohammad Emtiyaz Khan, Aleksandr Aravkin, Michael Friedlander, and Matthias Seeger. Fast dual variational inference for non-conjugate latent gaussian models. In International Conference on Machine Learning, pages 951 959, 2013.

[32] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable bayesian deep learning by weight-perturbation in adam. In International Conference on Machine Learning, pages 2616 2625, 2018.

[33] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[34] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive ﬂow. In Advances in Neural Information Processing Systems, pages 4743 4751, 2016.

[35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014.

[36] Ioannis Kosmidis and Dimitris Karlis. Model-based clustering using copulas with applications. Statistics and computing, 26(5):1079 1099, 2016.

[37] David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, and Aaron Courville. Bayesian hypernetworks. ar Xiv preprint ar Xiv:1710.04759, 2017.

[38] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automatic differentiation variational inference. The Journal of Machine Learning Research, 18(1):430 474, 2017.

[39] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar R atsch. Boosting black box variational inference. In Advances in Neural Information Processing Systems, pages 3401 3411, 2018.

[40] Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, and Gunnar Ratsch. Boosting variational inference: an optimization perspective. In International Conference on Artiﬁcial Intelligence and Statistics, pages 464 472, 2018.

[41] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pages 3290 3300, 2017.

[42] Christos Louizos and Max Welling. Structured and efﬁcient variational deep learning with matrix gaussian posteriors. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

[43] Christos Louizos and Max Welling. Multiplicative normalizing ﬂows for variational bayesian neural networks. In International Conference on Machine Learning, pages 2218 2227, 2017.

[44] Michael Mathieu and Yann Le Cun. Fast approximation of rotations and hessians matrices. ar Xiv preprint ar Xiv:1404.7195, 2014.

[45] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International Conference on Machine learning (ICML), 2017.

[46] Andrew C Miller, Nicholas J Foti, and Ryan P Adams. Variational boosting: Iteratively reﬁning posterior approximations. In International Conference on Machine Learning, pages 2420 2429, 2017.

[47] Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz Khan. Slang: Fast structured covariance approximations for bayesian deep learning with natural gradient. In Advances in Neural Information Processing Systems, pages 6246 6256, 2018.

[48] David S Moore and Marcus C Spruill. Uniﬁed large-sample theory of general chi-squared statistics for tests of ﬁt. The Annals of Statistics, pages 599 616, 1975.

[49] Marina Munkhoeva, Yermek Kapushev, Evgeny Burnaev, and Ivan Oseledets. Quadraturebased features for kernel approximation. In Advances in Neural Information Processing Systems, pages 9165 9174, 2018.

[50] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.

[51] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.

[52] Sarah E Neville, John T Ormerod, MP Wand, et al. Mean ﬁeld variational bayes for continuous sparse signal shrinkage: pitfalls and remedies. Electronic Journal of Statistics, 8(1):1113 1151, 2014.

[53] Victor M-H Ong, David J Nott, and Michael S Smith. Gaussian variational approximation with a factor covariance structure. Journal of Computational and Graphical Statistics, 27(3):465 478, 2018.

[54] Manfred Opper and C edric Archambeau. The variational gaussian approximation revisited. Neural computation, 21(3):786 792, 2009.

[55] Omiros Papaspiliopoulos, Gareth O Roberts, and Martin Skold. Non-centred parameterisations for hierarchical models and data augmentation. Bayesian Statistics, 7, 2003.

[56] Juho Piironen, Aki Vehtari, et al. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2):5018 5051, 2017.

[57] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In AISTATS, pages 814 822, 2014.

[58] Rajesh Ranganath, Dustin Tran, and David M Blei. Hierarchical variational models. In International Conference on Machine Learning, 2016.

[59] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. In Proceedings of The 32nd International Conference on Machine Learning, pages 1530 1538, 2015.

[60] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1278 1286, 2014.

[61] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901 909, 2016.

[62] M Sklar. Fonctions de repartition an dimensions et leurs marges. Publ. inst. statist. univ. Paris, 8:229 231, 1959.

[63] Michalis Titsias and Miguel L azaro-Gredilla. Doubly stochastic variational bayes for nonconjugate inference. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1971 1979, 2014.

[64] Michalis K Titsias and Francisco Ruiz. Unbiased implicit variational inference. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 167 176, 2019.

[65] Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder ﬂow. ar Xiv preprint ar Xiv:1611.09630, 2016.

[66] Dustin Tran, David Blei, and Edo M Airoldi. Copula variational inference. In Advances in Neural Information Processing Systems, pages 3564 3572, 2015.

[67] Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models. ar Xiv preprint ar Xiv:1702.08896, 2017.

[68] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008.

[69] Mingzhang Yin and Mingyuan Zhou. Semi-implicit variational inference. In International Conference on Machine Learning, pages 5646 5655, 2018.

[70] Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as variational inference. In International Conference on Machine Learning, pages 5847 5856, 2018.