# deep_gaussian_markov_random_fields__f0fad300.pdf

Deep Gaussian Markov Random Fields

Per Sid en 1 Fredrik Lindsten 1

Gaussian Markov random ﬁelds (GMRFs) are probabilistic graphical models widely used in spatial statistics and related ﬁelds to model dependencies over spatial structures. We establish a formal connection between GMRFs and convolutional neural networks (CNNs). Common GMRFs are special cases of a generative model where the inverse mapping from data to latent variables is given by a 1-layer linear CNN. This connection allows us to generalize GMRFs to multi-layer CNN architectures, effectively increasing the order of the corresponding GMRF in a way which has favorable computational scaling. We describe how well-established tools, such as autodiff and variational inference, can be used for simple and efﬁcient inference and learning of the deep GMRF. We demonstrate the ﬂexibility of the proposed model and show that it outperforms the state-ofthe-art on a dataset of satellite temperatures, in terms of prediction and predictive uncertainty.

1. Introduction

Convolutional neural networks (CNNs) are the de facto standard model in computer vision when training on a large set of images. Images are lattice-based data with local dependencies and thus have clear connections with spatial statistics. However, many spatial problems lack the abundance of data common to computer vision applications, and often we need to build a model based on a single image , i.e., data recorded over some spatial ﬁeld. Models such as deep image prior (Ulyanov et al., 2018) have shown that CNN architectures can encode useful spatial priors even in such situations, but the dominant approach is still to model the spatial dependencies explicitly using, e.g., Gaussian processes (GPs) (Williams & Rasmussen, 2006) or Gaussian

1Division of Statistics and Machine Learning, Department of Computer and Information Science, Link oping University, Link oping, Sweden. Correspondence to: Per Sid en <per.siden@liu.se>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

Markov random ﬁelds (GMRFs) (Rue & Held, 2005).

In this paper we show a formal connection between GMRFs applied to lattice data and CNNs. Common GMRFs based on nearest neighbour interactions can be viewed as special cases of a generative model where the inverse mapping from the spatial ﬁeld x to a latent variable z is given by a 1-layer linear CNN. Since common GP models have previously been shown to be tightly linked with GMRFs (Lindgren et al., 2011), this connection applies to certain GPs as well.

Modeling the inverse mapping (x z) using a CNN results in an auto-regressive (AR) spatial model, whereas using a CNN for the forward mapping (z x) would correspond to a moving average (MA) model (see, e.g., (Ljung, 1999) for a discussion on AR and MA models in a time series context). This has the important implication that we obtain an inﬁnite receptive ﬁeld (i.e., inﬁnite range on the spatial dependencies in x) even with a 1-layer model. Indeed, this is a well known property of GMRFs.

The interpretation of a GMRF as a 1-layer CNN opens up for a straightforward generalization to multi-layer architectures, resulting in deep GMRFs (DGMRFs). Even when all layers are linear this has important practical implications: adding layers corresponds to increasing the auto-regressive order of the GMRF, i.e., edges are added to the GMRF graph which improves its expressivity. For conventional GMRF algorithms, simply adding more edges can have have a signiﬁcant impact on the computational complexity due to a reduced degree of sparsity of the resulting precision matrix. For a DGMRF, on the other hand, the structure imposed by a multi-layer CNN architecture results in a favorable computational scaling. Indeed, using variational inference for the latent variables, we obtain a learning algorithm that scales linearly both with the number of dimensions in the spatial ﬁeld ( pixels ) and the number of layers (AR order). Furthermore, viewing GMRFs through the lens of deep learning allows us to use well-established toolboxes for, e.g., automatic differentiation and GPU training, to facilitate simple learning of DGMRFs.

After a review of GMRFs in Section 2, we introduce the DGMRF model in Section 3. Section 4 describes how to efﬁciently train the model, and how to compute the posterior predictive distribution, including uncertainty. We discuss related work in Section 5. The results in Section 6 illus-

trate how DGMRFs are adaptive models, with outstanding predictive ability. Section 7 concludes.

2. Background

2.1. Gaussian Markov Random Fields

A Gaussian Markov random ﬁeld (GMRF) x is an Ndimensional Gaussian vector with mean µ and precision (inverse covariance) matrix Q, so that x N µ, Q 1 . For each GMRF there exists a graph G = (V, E), with vertices V that correspond to the elements in x, and edges E that deﬁne their conditional independencies. For example, the vertices could represent the pixels in an image, with the edges connecting neighboring pixels. Formally,

{i, j} / E xi xj|x ij, for all i = j,

where x ij refers to all elements except i and j, meaning that two elements xi and xj, that are not neighbors in the graph, are conditionally independent given the rest. Importantly, the edges E also determine the zero-pattern in the precision matrix Q, as every GMRF has the property

{i, j} E Qij = 0, for all i = j.

This means that a sparsely connected graph G results in a sparse precision matrix Q, which gives great computational gains compared to working with the dense covariance matrix in many large-scale applications.

2.2. Example: GMRF Deﬁned Using Convolution

As an example, consider the second-order intrinsic GMRF or thin-plate spline model (Rue & Held, 2005), which can be deﬁned by x N 0, G G 1 , with

4 , for i = j 1 , for i j 0 , otherwise , (1)

where i j denotes adjacency1. Imputing missing pixel values conditioned on its second-order neighborhood using this prior is equivalent to bicubic interpolation. The nonzero elements of each row i of G and Q , with respect to neighboring pixels in 2D, can be compactly represented through the stencils

w G : h 1 1 4 1 1

" 1 2 8 2 1 8 20 8 1 2 8 2 1

1It is perhaps more standard to deﬁne Gii equal to the number of neighbors of pixel i, which makes a difference in the image border. Our deﬁnition is convenient here as it makes G invertible.

An equivalent deﬁnition of this model is to ﬁrst deﬁne z N(0, I) and then x through the inverse transform

z = Gx. (3)

It is trivial that x is Gaussian with mean 0 and it can be readily veriﬁed that Cov(x) = G 1IG = G G 1 . In a third, equivalent deﬁnition the inverse transform z = Gx is instead written as a convolution

Z = conv (X, w G) ,

where Z and X are image representations of the vectors z and x. The stencil w G in Eq. (2) is here used as a ﬁlter and conv() denotes same convolution (padding="SAME"), for the equivalence to hold. This observation, that a certain kind of GMRF can be deﬁned with the convolution operator, a ﬁlter and an auxiliary variable is a key observation for DGMRFs, which are deﬁned in Section 3.

2.3. Link Between GMRFs and Gaussian Processes

Another instructive view of GMRFs is as a represetation of a Gaussian processes (GP) with Mat ern kernel (Lindgren et al., 2011). This result comes from a stochastic partial differential equation (SPDE) of the form κ2 γ τx(s) = W(s),

which can be shown to have a GP solution x(s) with Mat ern covariance function (Whittle, 1954; 1963). Here, W(s) is Gaussian white noise in a continuous coordinate s, is the Laplacian operator, and κ, τ and γ are hyperparameters that appear in the Mat ern kernel. In particular, γ controls the smoothness of the GP. Moreover, for positive integer values of γ, a numerical ﬁnite difference approximation to the SPDE, on the integer lattice, is given by

τ κ2I + G γ x = z, (4)

with G deﬁned as in Eq. (1). As in Eq. (3), this inverse transform describes a GMRF, here with precision matrix Q = τ 2((κ2I + G)γ) (κ2I + G)γ. Firstly, this provides a sparse representation of a GP as a GMRF, with a discrepancy that can be reduced by making the discretization of the SPDE ﬁner. Secondly, it gives an interpretation of GMRFs as models with similar properties as GPs.

2.4. GMRFs as Spatial Priors

GMRFs are commonly used as spatial priors for latent variables, which is common in spatial statistics and image analysis. In the simplest case, the data y is modeled as Gaussian, and conditionally independent given x

i M p(yi|xi), yi|xi N yi|xi, σ2 ,

where M {1, . . . , N} is the set of observed pixels with M = |M|. Typical problems include inpainting (M < N), and denoising (σ2 > 0), where the target is to reconstruct the latent x. Conveniently, in this situation the GMRF prior is conjugate, so the posterior is also a GMRF

x|y N( µ, Q 1), with

σ2 Im, µ = Q 1 Qµ + 1

Here, the observation mask m has value 0 for missing pixels and 1 elsewhere, y are the observations with value 0 at missing pixels and Im is the identity matrix with diagonal element 0 at missing pixels. Although the posterior is on closed form, the computational cost associated with Q 1, needed for the posterior mean and marginal standard deviations, can be high. This is discussed more in Section 4.

3. Deep Gaussian Markov Random Fields

3.1. Linear DGMRFs

We deﬁne a linear DGMRF x using an auxiliary standard Gaussian vector z and a bijective function gθ : RN RN,

z N (0, I) , z = gθ(x).

In other words, we deﬁne x through the inverse transform g 1 θ . The function gθ(x) is for now assumed to be linear, so we can write gθ(x) = Gθx + bθ, where Gθ is an invertible square matrix. The non-linear case is considered in Section 3.4. We will specify gθ(x) using a CNN with L layers. Let Zl be a tensor of dimension H W C, with height H, width W and C channels, and let zl = vec (Zl) be its vectorized version, with length N = HWC. The output of layer l is deﬁned as

Zl = conv (Zl 1, wl) + bl, (6)

where wl is a 4D-tensor containing C C 2D-ﬁlters and bl is a set of C biases. Here, conv() refers to multichannel same convolution, more details are given in Section 3.3. In particular, we deﬁne ZL Z and Z0 X. An illustration of the model is given in Figure 1. The model parameters θ = {(wl, bl) : l = 1, . . . , L} will be omitted in the following for brevity.

Just as for normalizing ﬂows (Dinh et al., 2014; Rezende & Mohamed, 2015), g can be seen as a sequence of bijections g = g L g L 1 g1, each with corresponding transform matrix Gl. Since g is linear, x is a GMRF with density

p(x) = |det (G)|

(2π)N/2 exp 1

2 (x µ) G G (x µ) ,

(7) with G = GLGL 1 G1 and the mean µ = G 1b where b can be computed as b = g(0). The determinant

Figure 1. Illustration of the deep GMRF. The observed data Y are incomplete measurements of the latent DGMRF X. The prior distribution of X is deﬁned by a CNN that transforms the input X to the output image Z which has iid. standard normal pixel values.

det(G) can be computed as

det (G) = QL l=1 det (Gl) (8)

and we address the problem of making this computation fast below in Section 3.2. This GMRF has precision matrix Q = G G, that is guaranteed to be positive (semi-)deﬁnite for all θ, which gives an advantage compared to modeling Q directly. The reason for deﬁning x through an inverse transform z = g(x), rather than a forward transform, is twofold. Firstly, it establishes a formal connection to traditional GMRFs, see Proposition 1 below. Secondly, it gives an AR model with inﬁnite receptive ﬁeld, meaning that the output prediction at each pixel depends on all the pixels, rather than just on nearby pixels, even for a one-layer model. Compared to dilated CNNs (e.g. Yu & Koltun, 2015), which achieve long (yet ﬁnite) range dependencies through several layers of dilated ﬁlters, this is a simpler construction.

3.2. Computationally Cheap Convolutional Filters

In this paper, we consider two special forms of convolutional ﬁlters: plus (+) ﬁlters and sequential (seq) ﬁlters, with forms

+ : h a3 a2 a1 a4 a5

i seq : h a1 a2 a3 a4 a5

where a1, . . . , a5 R are parameters to be learned, and the empty positions are ﬁxed to zero. The beneﬁt of ﬁlters with these special designs is that they correspond to transforms for which det(Gl) in Eq. (8) can be cheaply computed.

Deﬁning a linear DGMRF through a sequence of convolutions using different small ﬁlters is of course equivalent to deﬁning it using a single convolution with a larger ﬁlter, apart from at the boundary. The equivalent larger ﬁlter can be obtained by sequentially convolving the smaller ﬁlters with themselves. The main motivation for using the deep

deﬁnition is that it has cheap determinant calculations, using Eq. (8) and the +- and seq-ﬁlters, which is not the case for a general larger ﬁlter. Also, a deep deﬁnition results in fewer parameters to learn. For example, four full 3 3ﬁlters has 36 parameters compared to 81 parameters for the corresponding 9 9-ﬁlter. Finally, the deep convolutional architecture has proven successful in countless applications of neural networks, which makes it plausible that also GMRFs should beneﬁt from the same inductive biases. The addition of non-linearities between the layers is discussed in Section 3.4. We now discuss the two ﬁlter types and their determinant computation for the singlechannel case and discuss the multichannel construction in Section 3.3.

3.2.1. +-FILTERS

We begin with two propositions, the ﬁrst connecting linear DGMRFs with +-ﬁlters to the GMRFs in Section 2, and the second giving the cheap determinant computation.

Proposition 1. The second-order intrinsic GMRF, as well as the GMRF approximation of a Mat ern GP, are special cases of the linear DGMRF model with +-ﬁlters.

Proof. The non-zero pattern of the +-ﬁlter is the same as that of wg in Eq. (2), meaning that a one-layer linear DGMRF with the same ﬁlter weights is equivalent to the second-order intrinsic GMRF. Similarly, Eq. (4) can be written as a linear DGMRF with L layers of +-ﬁlters, for L γ. This requires that γ of the layers have the same ﬁlter with a1 = 4 + κ2 and a2 = = a5 = 1, and the other L γ layers to be identity functions.

Proposition 2. The linear transform matrix G+ deﬁned by singlechannel same convolution of an H W image with the +-ﬁlter deﬁned in Eq. (9), has determinant

a1 + 2 a3a5 cos π i H + 1

2 a2a4 cos π j W + 1

where 1 is treated as imaginary. Computing the determinant thus has complexity O(N).

The proof, given in detail in the supplement, is to show that the factors of this product are indentical to the eigenvalues of G+, which can be done by writing G+ as a Kronecker sum of tridiagonal Toeplitz matrices. Proposition 2 provides a fast method for computing the determinant, which would have complexity O(N 3) for a general square matrix, or O(N 3/2) if based on sparse Cholesky factorization (Rue & Held, 2005). In practice, we reparameterize a1, . . . , a5 to ensure that all the eigenvalues are real and positive, see details in the supplement. Without this constraint, we have

observed unstable, oscillating solutions in some situations, and the constraint also ensures that the bijective assumption is valid.

3.2.2. SEQ-FILTERS

For some ordering of the image pixels, the output pixels for a convolution with a seq-ﬁlter only depend on previous input pixels. This is equivalent to saying that the corresponding transform matrix Gseq yields a lower triangular matrix P Gseq P for some permutation matrix P, which implies that det(Gseq) = a N 1 . Seq-ﬁlters are therefore extremely cheap, but somewhat restricted compared to +-ﬁlters, due to the inability to model non-sequential patterns. However, the seq-ﬁlter in Eq. (9) can be rotated/mirrored in eight different ways, each encoding a different pixel ordering. Thus different sequential dependencies can be modelled in different layers. Also seq-ﬁlters can trivially be extended to 5 5or 7 7-ﬁlters, still without changing the determinant. Connections to auto-regressive models, for example Pixel CNN (Van den Oord et al., 2016b), are discussed in Section 5.

3.3. Multichannel Construction

When C > 1, Eq. (6) can be written on vector form as

zl = Glzl 1 + bl

Gl,1,1 Gl,1,C ... ... ... Gl,C,1 Gl,C,C

bl,11 ... bl,C1

where Gl,i,j is the transition matrix of a single convolution from input channel j to output channel i, and bl,j is the bias of output channel j. In order to make det(Gl) computationally tractable, we set Gl,i,j = 0 for i < j, making Gl lower block triangular and

det (Gl) = QC c=1 det (Gl,c,c) .

The ordering of channels could vary between different layers to allow information to ﬂow back and forth between all channels. One could also add invertible 1 1 convolutions (Kingma & Dhariwal, 2018) between layers for this ordering to be dynamic and learnable. A multichannel DGMRF could learn more interesting representations by storing different features in different channels of the hidden layers. Even with singlechannel data, the hidden layers can be multichannel by using a multiscale architecture (Dinh et al., 2017).

3.4. Non-Linear Extension

The linear DGMRFs can be extended by adding non-linear activation functions between layers of the neural network gθ(x). Formally, Eq. (6) is replaced by

Zl = ψl (conv (Zl 1, wl) + bl) ,

where ψl is a non-linear scalar function that operates element-wise. We restrict ψl to be strictly increasing, to ensure that gθ(x) is a bijection.

The distribution of x can now be computed by the change of variable rule

log p(x) = log p (z) + log |det (dz/dx)|

= log p (z) +

l=1 log |det (Gl)| +

i=1 log |ψ l (hl,i)| ,

where hl = vec(conv(Zl 1, wl) + bl) is the input to ψl and ψ l is the derivative. Just as in the linear case, the computational cost of this density is linear.

Per default we assume that ψl are Parametric Rectiﬁed Linear Units (PRe LUs), deﬁned by

( h, αlh, for h 0 for h < 0 ,

where αl are learnable parameters with αl > 0. We can now add α1, . . . , αL to the parameters θ and optimize them jointly.

4. Learning and Inference

There exist two kinds of unknown variables that we wish to infer: the latent ﬁeld x and the model parameters θ. Since we are not interested in the posterior uncertainty of θ directly, we take a practical course of action and optimize these, using a variational lower bound on the marginal likelihood p(y|θ). Given the optimal value ˆθ, for the linear model, we make a fully Bayesian treatment of the posterior p(x|ˆθ, y).

4.1. Optimization of Parameters

For the linear DGMRF, the marginal likelihood can be written on closed form as

p (y|θ) = p (y|x, θ) p (x|θ)

for arbitrary value of x . Unfortunately, this expression requires the determinant of the posterior precision matrix det(G G + σ 2Im), which is computationally infeasible for large N. Instead, we focus on the variational evidence lower bound (ELBO) L (θ, φ, y) log p(y|θ) which can be written as

L (θ, φ, y) = Eqφ(x) [ log qφ(x) + log p (y, x|θ)] ,

where qφ(x) is the variational posterior approximation, which depends on variational parameters φ. We here only intend to use qφ(x) as a means for optimizing θ, and not

for example to make posterior predictions. For simplicity, we choose qφ(x) = N(x|ν, S) with diagonal covariance matrix, and φ = {ν, S}. After inserting the variational and model densities and simplifying, we can write the ELBO as

L (θ, φ, y) = 1

2 log |det (Sφ)| M log σθ + log |det (Gθ)|

gθ(x) gθ(x) + 1

σ2 θ (y x) Im (y x) ,

where constant terms have been omitted, and all parameters have been subscripted with θ and φ to clarify whether they are model or variational parameters. Additionally, we use the reparameterization trick (Kingma & Welling, 2013) replacing the last expectation with a sum over Nq standard random samples ε1, . . . , εNq, and set xi = νφ + S1/2 φ εi in the sum. This gives an unbiased estimator of the ELBO, which has low variance. Moreover, this estimator is differentiable with respect to θ and φ and can be used for stochastic gradient optimization with autodiff and backprop. Parameter learning will be fast, with a time complexity that is O(N) for a ﬁxed number of iterations of optimization, since backprop in a CNN is linear and so are the determinant computations described in Section 3.2. This can be compared with O(N 3) for standard GP and O(N 3/2) for standard GMRF inference in 2D problems, based on the Cholesky-decomposition. The ELBO can be trivially extended to the non-linear model by replacing log | det(Gθ)| in Eq. (10).

4.2. Exact Inference for the Latent Field

For a linear DGMRF, the conditional posterior p(x|ˆθ, y) is also a GMRF, see Eq. (5). Even though computing this density is too costly in general, it is possible to compute the posterior mean and to draw samples from the posterior, which can be used for making predictions. Both require solving linear equation systems Qx = c involving the posterior precision matrix Q = G G + σ 2Im. For this we use the conjugate gradient (CG) method (see e.g. Barrett et al., 1994), which is an iterative method that, rather than exactly computing x = Q 1c, iteratively minimizes the relative residual Qx c / c until it falls below some threshold δ, that can be set arbitrarily low. In practice, δ = 10 7 gives solutions that visually appear to be identical to the exact solution. CG only requires matrix-vector-multiplications, which means that the multiplications with G and G can be implemented in a matrix-free fashion using convolution.

Posterior sampling for x can be performed using the method of Papandreou & Yuille (2010) by computing

xs = Q 1 G (u1 b) + 1

σ2 (y + σImu2) ,

where u1 and u2 are standard Gaussian random vectors. It

Algorithm 1 Inference algorithm

Input: data y, model structure, learning rate, Nq, etc. Initialize values for the model parameters θ {(wl, bl, αl)l=1,...,L, σ2} and the variational parameters φ = {ν, S} Optimize the ELBO (Eq. 10) wrt. θ and φ using Adam to obtain ˆθ and ˆφ if the model is linear then

Compute the mean and marginal variances of the posterior p(x|y, ˆθ) using CG and simple RBMC else

Approximate the mean and marginal variances of the posterior p(x|y, ˆθ) with ˆν and diag(ˆS) end if Compute the predictive means E(y i |y, ˆθ) = E(xi|y, ˆθ) and variances Var(y i |y, ˆθ) = Var(xi|y, ˆθ) + σ2

can be easily veriﬁed that xs is Gaussian, with the same mean and covariance as the posterior.

Given a number of posterior samples, the posterior marginal variances Var(xi|y, ˆθ) can be naively approximated using Monte Carlo estimation, but more efﬁciently approximated using the simple Rao-Blackwellized Monte Carlo (simple RBMC) method by Sid en et al. (2018). This provides a way to compute the posterior predictive uncertainty as

Var(y i |y, ˆθ) = Var(xi|y, ˆθ) + σ2.

The time complexity is here mainly decided by the complexity of the CG method, which can be described as O(N κ), where κ is the condition number of Q (Shewchuk, 1994). It is difﬁcult to say how κ depends on N for a general Q, but for a two-dimensional, second-order elliptic boundary value problems, for example the SPDE approach with γ = 1, κ is linear in N so the method is O(N 3/2). Even though this is the same as for the sparse Cholesky method, we have observed CG to be a lot faster in practice. Also, the storage requirements for CG is O(N), versus O(N log N) for Cholesky.

A drawback with non-linear DGMRFs is that they do not give a simple posterior p(x|ˆθ, y). We could use the variational approximation qφ(x) instead, but the proposed Gaussian independent variational family is probably too simple to approximate the true posterior in a satisfying way. We discuss possible solutions to this limitation in Section 7.

The inference algorithm is summarized in Algorithm 1.

5. Related Work

GMRFs have a long history in spatial and image analysis (see e.g. Woods, 1972; Besag, 1974 and Rue & Held, 2005). The mentioned SPDE approach has been extended in numerous ways leading to ﬂexible GMRF models, with for example non-stationarity (Fuglstad et al., 2015), oscillation (Lindgren et al., 2011) and non-Gaussianity (Bolin, 2014). A nested (deep) SPDE is introduced by Bolin & Lindgren (2011), which results in a model similar to ours, but which lacks the connection to CNNs and some of the computational advantages that we provide.

Papandreou & Yuille (2010) use GMRFs for inpainting in natural images, in combination with CG to obtain posterior samples. They do not learn the form of the prior precision matrix, but instead rewrite the model as a product of Gaussian experts, and learn the mean and variance of the Gaussian factors. Papandreou & Yuille (2011) assume a heavy-tailed non-Gaussian prior for the latent pixel values, and use a variational GMRF approximation for the posterior.

GPs (Stein, 1999; Williams & Rasmussen, 2006) are commonplace for modelling spatially dependent data, not necessarily restricted to observations on the grid. The GP covariance kernel, which encodes the spatial dependencies, can be made ﬂexible (Wilson & Adams, 2013) and deep (Dunlop et al., 2018; Roininen et al., 2019). However, standard GPs are limited by O(N 3) computational complexity (assuming the number of measurements M O(N)).

Inducing point methods (Qui nonero-Candela & Rasmussen, 2005; Snelson & Ghahramani, 2006) can reduce the complexity to O(P 2N + P 3) or even O(P 3) (Hensman et al., 2013), where P is the number of inducing points. However, for grid data, these methods tend to over-smooth the data (Wilson & Nickisch, 2015) unless P is chosen in the same order of magnitude as N. When the data are on a regular grid and fully observed, Kronecker and Toeplitz methods can be used for fast computation (e.g. Saatc i, 2011; Wilson, 2014). Under certain assumptions about the interactions across the input dimensions, such as additivity and separability, this can reduce the complexity to O(N log N). These methods can also be extended to when the data are not on the grid (Wilson & Nickisch, 2015), or when the grid is incomplete (Stroud et al., 2016).

5.3. Deep Generative Models

Deep generative models are normally trained on large datasets, and can then be used to generate new samples from the data distribution, or to explore learned latent representa-

tions. For example, generative adversarial networks (GANs, Goodfellow et al., 2014) have been used for inpainting problems in natural images (see e.g. Yu et al., 2018; Liu et al., 2018) with impressive results. Our model is mainly targeted at smaller datasets, as single images, where too complex models are likely to overﬁt. Instead, we achieve an inﬁnite receptive ﬁeld through the inverse deﬁnition z = g(x), which require few parameters. Also, linear DGMRFs can readily provide uncertainty estimates, which is prohibitive for most deep generative models. Overall, deep generative models are difﬁcult to use for typical spatial problems, but these models share ideas with our proposed model as we discuss below.

Flow-based or invertible generative models, e.g., NICE (Dinh et al., 2014), normalizing ﬂows (Rezende & Mohamed, 2015), IAF (Kingma et al., 2016), MAF (Papamakarios et al., 2017), real NVP (Dinh et al., 2017), Glow (Kingma & Dhariwal, 2018) and i-Res Net (Behrmann et al., 2019), model x as a bijective function of a latent variable from a simple base distribution, just as our model. The likelihood can be optimized directly for parameter learning, but generally require all pixels to be non-missing, that is, these methods are not designed to handle the situation with incomplete or noisy observations that we are considering. One connection to our work is that a linear DGMRF with seqﬁlters can be seen as a fully linear MAF with masked convolutional MADE-layers. Dinh et al. (2014) use a learned model for inpainting with projected gradient ascent, which gives the posterior mode but no uncertainty quantiﬁcation. Lu & Huang (2019) provide a conditional version of Glow for inpainting, but this requires missing pixels to be known during training, and the predictive distribution is modeled directly and not through a Bayesian inversion of the model.

Auto-regressive models, e.g. Pixel RNN (Van den Oord et al., 2016a), Pixel CNN (Van den Oord et al., 2016b) and Pixel CNN++ (Salimans et al., 2017), model the pixel values sequentially similarly to our proposed seq-ﬁlters, but there are some major differences. The seq-ﬁlters use masked convolutions to obtain cheap determinant computations and the ordering of pixels can change between layers, whereas autoregressive models have no latent variables and are trained using the same image as input and output. (Van den Oord et al., 2016a) consider image completion, but are limited to the e.g. the case where the bottom half of the image is missing, and can t handle a general mask.

Variational autoencoders (Kingma & Welling, 2013) differ from our model in that they use the deep transform of the latent variables for modeling the parameters of the distribution of the data, rather than for the data directly. This makes the recovery of the latent variables intractable, which is not the case for our model, however, we still use the same variational optimization algorithm for speedups.

Deep image prior (DIP, Ulyanov et al., 2018) uses CNNs, trained on single images, for denoising and inpainting, with promising results. However, this is not truly a probabilistic generative model, since the latent variables are not inferred, but ﬁxed to some initial random values. Thus, it would be difﬁcult to output predictive uncertainty from such a model.

6. Experiments

We have implemented DGMRF in Tensor Flow (Abadi et al., 2016), taking advantage of autodiff and GPU computations. We train the parameters, and compute the predictive distribution using the CG algorithm. To avoid boundary effects, the images are extended with a 10-pixel wide frame of missing values at all sides. Additional details about the implementation can be found in the supplement. Code for our methods and experiments are available at https://bitbucket.org/psiden/deepgmrf.

6.1. Toy Data

We demonstrate the behaviour of our model for an inpainting problem on the two toy datasets in Figure 2, which have size 160 120 pixels. The data in the ﬁrst row are generated as a random sample from the Mat ern GMRF in Eq. (4) with γ = 1, τ = 1, and κ2 = 8/502 corresponding to a spatial correlation range of 50 pixels. The data in the second row are the same, but we have added horizontal and vertical edges, to investigate the model s ability to capture such patterns. Column 2 shows the posterior mean for x when the Mat ern model with the same hyperparameters is used for prediction, which gives optimal results for the ﬁrst row, but which over-smooths the edges in the second row. The corresponding results for different instances of the linear DGMRF, are shown in column 3-6. They all perform well in the simple case without edges, in which the +-ﬁlter models contain the true model according to Proposition 1. In the case with edges, the model with depth L = 1 which corresponds to a standard GMRF is too simple to handle the more complex structure in the data, but for L = 3, all the models give reasonable results, that preserve the edges.

Figure 3 displays the learned +-ﬁlters of the 3-layer model, and the values of valid pixels of the hidden layers, when original data, including missing pixels, are used as input. The ﬁrst two ﬁlters learns differentiation in the vertical and horizontal direction, and the third ﬁlter is close to an identity function. Most spatial structure is removed in layer 3, that is assumed to be standard normal by the model.

6.2. Satellite Data

We compare our method against some popular methods for large data sets in spatial statistics, by considering the satel-

Figure 2. Posterior mean for inpainting the 160 120 pixel toy data without edges (top) and with edges (bottom). The second column shows inpainting by the same Mat ern GMRF model and hyperparameters that were used to generate the data without edges.

Figure 3. A linear deep GMRF with 3 layers of learned +-ﬁlters for the toy data with edges. The ﬁlters collaborate to remove spatial structures to the ﬁnal layer.

lite data of daytime land surface temperatures, used in the competition by Heaton et al. (2018). The data are on a 500 300 grid, with 105,569 non-missing observations as training set. The test set consists of 42,740 observations and have been selected as the pixels that were missing due to cloud cover on a different date. The dataset and the missing pixels are shown in the supplement, together with the posterior output from our model. The data and code for some of the methods can be found at https://github. com/finnlindgren/heatoncomparison. The participants of the competition were instructed to use exponential correlation if applicable, and to add a constant mean and linear effects of the latitude and longitude. For this rea-

son, we extend the measurement equation to include linear trends, so that

yi|xi N yi|xi + Fi, β, σ2 , (11)

where F is a spatial covariate matrix with columns corresponding to (constant, longitude, latitude), and β is a 3-dimensional regression coefﬁcient vector, which can be integrated out jointly with x for the predictions, see details in the supplement.

Table 1 compares different instances of our model with the methods in the competition, which are brieﬂy described in the supplement, and in more detail in Heaton et al. (2018). One of the top competitors, SPDE, is essentially the same method as described in Section 2.3. For comparison with another deep method, we have also included results for DIP using the authors own implementation2. For DIP, removal of linear trends and normalization to [0, 1] were done in preprocessing, and these steps were inverted before evaluation. As DIP does not give uncertainty estimates, we also compare with an ensemble of 10 DIP models, using the ensemble mean and standard deviation as predictive distribution. The scores used are mean absolute error (MAE), root-mean-squared-error (RMSE), mean continuous rank probability score (CRPS), mean interval score (INT), and prediction interval coverage (CVG). CRPS and INT are proper scoring rules, that also account for the predictive uncertainty (Gneiting & Raftery, 2007). These are the same scores that were used in the competition.

2https://github.com/Dmitry Ulyanov/deep-image-prior

Table 1. Prediction scores on the satellite data. The scores of the methods in the upper pane come from Table 3 in Heaton et al. (2018). Our models are presented in the lower pane. Lower scores are better, except from CVG, for which 0.95 is optimal. The results for our models are averages computed across ﬁve random seeds. Standard deviations across seeds are shown in parenthesis for seq5 5,L=5, and for the other models in the supplement.

Method MAE RMSE CRPS INT CVG

FRK 1.96 2.44 1.44 14.08 0.79 Gapﬁll 1.33 1.86 1.17 34.78 0.36 Lattice Krig 1.22 1.68 0.87 7.55 0.96 LAGP 1.65 2.08 1.17 10.81 0.83 Meta Kriging 2.08 2.50 1.44 10.77 0.89 MRA 1.33 1.85 0.94 8.00 0.92 NNGP 1.21 1.64 0.85 7.57 0.95 Partition 1.41 1.80 1.02 10.49 0.86 Pred. Proc. 2.15 2.64 1.55 15.51 0.83 SPDE 1.10 1.53 0.83 8.85 0.97 Tapering 1.87 2.45 1.32 10.31 0.93 Peri. Embe. 1.29 1.79 0.91 7.44 0.93

DIP 1.53 2.06 - - - DIP ensemble 1.30 1.67 0.96 11.82 0.72

DGMRF (our) seq5 5,L=1 1.06 1.42 0.76 7.21 0.97 seq5 5,L=3 0.95 1.3 0.75 8.29 0.97 seq5 5,L=5 0.93 1.25 0.74 8.14 0.97 (.037) (.051) (.012) (.461) (.001) seq3 3,L=5 1.16 1.57 0.81 6.98 0.97 +L=5 1.09 1.47 0.78 7.63 0.97 seq5 5,L=5,NL 1.37 1.87 - - -

For the DGMRFs, based on the ﬁrst three scores, the seqﬁlters of size 5 5 perform better compared to those of size 3 3 and the +-ﬁlters, which may be due to the increased ﬂexibility of larger ﬁlters. Moreover, deeper models tend to give better results. For the non-linear (NL) model, CG cannot be used to compute posterior mean and uncertainty, and the ﬁrst two scores are instead computed based on the mean of the variational approximation. We note that the NL model performs worse than the linear. Our primary explanation for this disappointing result is that the variational posterior is insufﬁcient in approximating the true posterior. An independent distribution naturally has limited ability to approximate a spatially dependent posterior, and empirically we have also seen that the variational approximation for the linear model performs much worse than the true posterior (results not shown). DGMRF outperforms all the methods from the competition on all criteria except CVG, where it is slightly worse than NNGP. In terms of MAE, RMSE and CRPS the improvement is substantial. It is difﬁcult to compare computing times due to different hardware, but our method takes roughly 2.5h for the seq5 5,L=5 model using a Tesla K40 GPU, out of which roughly 95% of the time is for parameter learning.

7. Conclusions and Future Work

We have proposed deep GMRFs which enable us to view (high-order) GMRF models as CNNs. We have focused our attention on lattice-based graphs to clearly show the connection to conventional CNNs, however, the proposed method can likely be generalized to arbitrary graphs via graph convolutions (see, e.g., Xu et al., 2019), and to continuously referenced data, similar to Bolin & Lindgren (2011). We have also primarily considered linear CNNs, resulting in a model that is indeed equivalent to a GMRF. The DGMRF nevertheless has favorable properties compared to conventional algorithms for GMRFs: the CNN architecture opens up for simple and efﬁcient learning, even though the corresponding graph will have a high connectivity when using multiple layers of the CNN. Empirically we have found that multi-layer architectures are indeed very useful even in the linear case, enabling the model to capture complex data dependencies such as distinct edges, and result in state-ofthe-art performance on a spatial benchmark problem.

Using a CNN for the inverse mapping when deﬁning the DGMRF results in a spatial AR model. This gives rise to an inﬁnite receptive ﬁeld, but can also result in instability of the learned prior. We have constrained the ﬁlter parameters to yield real eigenvalues, and under this constraint we have not seen any issues with instability. A more principled way of addressing this potential issue is left for future work.

The CNN-based interpretation offers a natural extension, namely to introduce non-linearities to enable modeling of more complex distributions for the prior p(x). A related extension would be to replace p(x) with one of the ﬂowbased or auto-regressive generative models mentioned in Section 5.3, but appropriately restricted to avoid overﬁtting on a single image. Further exploring these extensions requires future work. In particular we believe that the independent variational approximation is insufﬁcient to accurately approximate the posterior distribution over latent variables in non-linear DGMRFs. One way to address this limitation is to parameterize the square-root of the covariance matrix S1/2 φ of the variational approximation qφ directly as a lower triangular matrix. Another approach is to model qφ as a GMRF with the same graph structure as the original model.

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorﬂow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265 283, 2016.

Barrett, R., Berry, M. W., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., and

Van der Vorst, H. Templates for the solution of linear systems: building blocks for iterative methods, volume 43. Siam, Philadelphia, PA, 1994.

Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D., and Jacobsen, J.-H. Invertible residual networks. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 573 582, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings. mlr.press/v97/behrmann19a.html.

Besag, J. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):192 236, 1974.

Bolin, D. Spatial mat ern ﬁelds driven by non-gaussian noise. Scandinavian Journal of Statistics, 41(3):557 579, 2014.

Bolin, D. and Lindgren, F. Spatial models generated by nested stochastic partial differential equations, with an application to global ozone mapping. Annals of Applied Statistics, 5(1):523 550, 2011.

Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. ar Xiv preprint ar Xiv:1605.08803, 2017.

Dunlop, M. M., Girolami, M. A., and Stuart, A. M. How Deep Are Deep Gaussian Processes? Journal of Machine Learning Research, 19:1 46, 2018.

Fuglstad, G.-A., Lindgren, F., Simpson, D., and Rue, H. Exploring a new class of non-stationary spatial Gaussian random ﬁelds with varying local anisotropy. Statistica Sinica, 25(1):115 133, 2015.

Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378, 2007.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Guinness, J. and Fuentes, M. Circulant embedding of approximate covariances for inference from gaussian data on large lattices. Journal of computational and Graphical Statistics, 26(1):88 97, 2017.

Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B.,

Hammerling, D., Katzfuss, M., Lindgren, F., Nychka, D. W., Sun, F., and Zammit-mangion, A. A Case Study Competition Among Methods for Analyzing Large Spatial Data. Journal of Agricultural, Biological and Environmental Statistics, 24(3):398 425, 2018. doi: 10.1007/s13253-018-00348-w.

Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian Processes for Big Data. In Uncertainty in Artiﬁcial Intelligence (UAI), 2013.

Kingma, D. P. and Dhariwal, P. Glow : Generative Flow with Invertible 1x1 Convolutions. In Advances in Neural Information Processing Systems, pp. 10215 10224, 2018.

Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive ﬂow. In Advances in neural information processing systems, pp. 4743 4751, 2016.

Lindgren, F., Rue, H., and Lindstr om, J. An explicit link between Gaussian ﬁelds and Gaussian Markov random ﬁelds: The SPDE approach. Journal of the Royal Statistical Society Series B, 73(4):423 498, 2011.

Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C., Tao, A., and Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85 100, 2018.

Ljung, L. System identiﬁcation, Theory for the user. System sciences series. Prentice Hall, Upper Saddle River, NJ, USA, second edition, 1999.

Lu, Y. and Huang, B. Structured Output Learning with Conditional Generative Flows. ar Xiv preprint ar Xiv:1905.13288, 2019.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive ﬂow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338 2347, 2017.

Papandreou, G. and Yuille, A. Gaussian sampling by local perturbations. Advances in Neural Information Processing Systems 23, 90(8):1858 1866, 2010.

Papandreou, G. and Yuille, A. L. Efﬁcient variational inference in large-scale Bayesian compressed sensing. 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1332 1339, 2011.

Qui nonero-Candela, J. and Rasmussen, C. E. A unifying view of sparse approximate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939 1959, 2005.

Rezende, D. and Mohamed, S. Variational inference with normalizing ﬂows. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1530 1538, 2015.

Roininen, L., Girolami, M., Lasanen, S., and Markkanen, M. Hyperpriors for Mat ern ﬁelds with applications in Bayesian inversion. Inverse Problems & Imaging, 13:1 29, 2019. URL http://aimsciences.org//article/id/ d17bde6b-3e5f-438d-af0a-b712cf433748.

Rue, H. and Held, L. Gaussian Markov random ﬁelds: theory and applications. CRC Press, 2005.

Saatc i, Y. Scalable Inference for Structured Gaussian Process Models. Ph D thesis, University of Cambridge, 2011.

Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modiﬁcations. ar Xiv preprint ar Xiv:1701.05517, 2017.

Shewchuk, J. R. An introduction to the conjugate gradient method without the agonizing pain. Technical Report CS-94-125, Carnegie Mellon University, 1994.

Sid en, P., Lindgren, F., Bolin, D., and Villani, M. Efﬁcient covariance approximations for large sparse precision matrices. Journal of Computational and Graphical Statistics, 27(4):898 909, 2018.

Snelson, E. and Ghahramani, Z. Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pp. 1257 1264, 2006.

Stein, M. L. Interpolation of spatial data: some theory for kriging. Springer, 1999.

Stroud, J. R., Stein, M. L., and Lysen, S. Bayesian and Maximum Likelihood Estimation for Gaussian Processes on an Incomplete Lattice. Journal of Computational and Graphical Statistics, 26(1):108 120, 2016.

Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446 9454, 2018.

Van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel Recurrent Neural Networks. ar Xiv preprint ar Xiv:1601.06759, 2016a.

Van den Oord, A., Kalchbrenner, N., Vinyals, O., Espeholt, L., Graves, A., and Kavukcuoglu, K. Conditional Image Generation with Pixel CNN Decoders. In Advances in neural information processing systems, 2016b.

Whittle, P. On stationary processes in the plane. Biometrika, 41:434 449, 1954.

Whittle, P. Stochastic processes in several dimensions. Bulletin of the International Statistical Institute, 40(2):974 994, 1963.

Williams, C. K. and Rasmussen, C. E. Gaussian processes for machine learning. MIT press Cambridge, MA, 2006.

Wilson, A. and Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning, pp. 1067 1075, 2013.

Wilson, A. and Nickisch, H. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International Conference on Machine Learning, pp. 1775 1784, 2015.

Wilson, A. G. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. Ph D thesis, University of Cambridge, 2014.

Woods, J. W. Two-Dimensional Discrete Markovian Fields. IEEE Transactions on Information Theory, 18(2):232 240, 1972.

Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https:// openreview.net/forum?id=ry Gs6i A5Km.

Yu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions. ar Xiv preprint ar Xiv:1511.07122, 2015.

Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., and Huang, T. S. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5505 5514, 2018.