# deconditional_downscaling_with_gaussian_processes__8cc4a0c2.pdf Deconditional Downscaling with Gaussian Processes Siu Lun Chau University of Oxford Shahine Bouabid University of Oxford Dino Sejdinovic University of Oxford Refining low-resolution (LR) spatial fields with high-resolution (HR) information, often known as statistical downscaling, is challenging as the diversity of spatial datasets often prevents direct matching of observations. Yet, when LR samples are modeled as aggregate conditional means of HR samples with respect to a mediating variable that is globally observed, the recovery of the underlying fine-grained field can be framed as taking an inverse of the conditional expectation, namely a deconditioning problem. In this work, we propose a Bayesian formulation of deconditioning which naturally recovers the initial reproducing kernel Hilbert space formulation from Hsu and Ramos [1]. We extend deconditioning to a downscaling setup and devise efficient conditional mean embedding estimator for multiresolution data. By treating conditional expectations as inter-domain features of the underlying field, a posterior for the latent field can be established as a solution to the deconditioning problem. Furthermore, we show that this solution can be viewed as a two-staged vector-valued kernel ridge regressor and show that it has a minimax optimal convergence rate under mild assumptions. Lastly, we demonstrate its proficiency in a synthetic and a real-world atmospheric field downscaling problem, showing substantial improvements over existing methods. 1 Introduction Spatial observations often operate at limited resolution due to practical constraints. For example, remote sensing atmosphere products [2, 3, 4, 5] provide measurement of atmospheric properties such as cloud top temperatures and optical thickness, but only at a low resolution. Devising methods to refine low-resolution (LR) variables for local-scale analysis thus plays a crucial part in our understanding of the anthropogenic impact on climate When high-resolution (HR) observations of different covariates are available, details can be instilled into the LR field for refinement. This task is referred to as statistical downscaling or spatial disaggregation and models LR observations as the aggregation of an unobserved underlying HR field. For example, multispectral optical satellite imagery [6, 7] typically comes at higher resolution than atmospheric products and can be used to refine the latter. Statistical downscaling has been studied in various forms, notably giving it a probabilistic treatment [8, 9, 10, 11, 12, 13], in which Gaussian processes (GP) [14] are typically used in conjunction with a sparse variational formulation [15] to recover the underlying unobserved HR field. Our approach follows this line of research where we do not observe data from the underlying HR groundtruth field. On the other hand, deep neural network (DNN) based approaches [16, 17, 18] study this problem from a different setting, where they often assume that both HR and LR matched observations are available for training. Then, their approaches follow a standard supervised learning setting in learning a mapping between different resolutions. Indicates equal contribution Department of Statistics, Oxford, UK, OX1 3LB. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). However, both lines of existing methods require access to bags of HR covariates that are paired with aggregated targets, which in practice might be infeasible. For example, the multitude of satellites in orbit not only collect snapshots of the atmosphere at different resolutions, but also from different places and at different times, such that these observations are not jointly observed. To overcome this limitation, we propose to consider a more flexible mediated statistical downscaling setup that only requires indirect matching between LR and HR covariates through a mediating field. We assume that this additional field can be easily observed, and matched separately with both HR covariates and aggregate LR targets. We then use this third-party field to mediate learning and downscale our unmatched data. In our motivating application, climate simulations [19, 20, 21] based on physical science can serve as a mediating field since they provide a comprehensive spatiotemporal coverage of meteorological variables that can be matched to both LR and HR covariates. Formally, let bx = {x(1), . . . , x(n)} X be a general notation for bags of HR covariates, f : X R the field of interest we wish to recover and z the LR aggregate observations from the field f. We suppose that bx and z are unmatched, but that there exists mediating covariates y, y Y, such that (bx, y) are jointly observed and likewise for ( y, z) as illustrated in Figure 1. We assume the following aggregation observation model z = EX[f(X)|Y = y] + ε with some noise ε. Our goal in mediated statistical downscaling is then to estimate f given (bx, y) and ( y, z), which corresponds to the deconditioning problem introduced in [1]. Figure 1: The LR response z (blue) and the bag HR covariates bx (green) are unmatched. The downscaling is mediated through baglevel LR covariates y and y (orange). Motivated by applications in likelihood-free inference and tasktransfer regression, Hsu and Ramos [1] first studied the deconditioning problem through the lens of reproducing kernel Hilbert space (RKHS) and introduced the framework of Deconditional Mean Embeddings (DME) as its solution. In this work, we first propose a Bayesian formulation of deconditioning that results into a much simpler and elegant way to arrive to the DME-based estimator of Hsu and Ramos [1], using the conditional expectations of f. Motivated by our mediated statistical downscaling problem, we then extend deconditioning to a multi-resolution setup and bridge the gap between DMEs and existing probabilistic statistical downscaling methods [9]. By placing a GP prior on the sought field f, we obtain a posterior distribution of the downscaled field as a principled Bayesian solution to the downscaling task on indirectly matched data. For scalability, we provide a tractable variational inference approximation and an alternative approximation to the conditional mean operator (CMO) [22] to speed up computation for large multi-resolution datasets. From a theoretical stand point, we further develop the framework of DMEs by establishing it as a two-staged vector-valued regressor with a natural reconstruction loss that mirrors Grünewälder et al. [23] s work on conditional mean embeddings. This perspective allows us to leverage distribution regression theory from [24, 25] and obtain convergence rate results for the deconditional mean operator (DMO) estimator. Under mild assumptions, we obtain conditions under which this rate is a minimax optimal in terms of statistical-computational efficiency. Our contributions are summarized as follows: We propose a Bayesian formulation of the deconditioning problem of Hsu and Ramos [1]. We establish its posterior mean as a DME-based estimate and its posterior covariance as a gauge of the deconditioning quality. We extend deconditioning to a multi-resolution setup in the context of the mediated statistical downscaling problem and bridge the gap with existing probabilistic statistical downscaling methods. Computationally efficient algorithms are devised. We demonstrate that the DMO estimate minimises a two-staged vector-valued regression and derive its convergence rate under mild assumptions, with conditions for minimax optimality. We benchmark our model against existing methods for statistical downscaling tasks in climate science, on both synthetic and real-world multi-resolution atmospheric fields data, and show improved performance. 2 Background Materials 2.1 Notations Let X, Y be a pair of random variables taking values in non-empty sets X and Y, respectively. Let k : X X R and ℓ: Y Y R be positive definite kernels. The closure of the span of their canonical feature maps kx := k(x, ) and ℓy := ℓ(y, ) for x X and y Y respectively induces RKHS Hk RX and Hℓ RY endowed with inner products , k and , ℓ. We observe realizations b D1 = {bxj, yj}N j=1 of bags bxj = {x(i) j }nj i=1 from conditional distribution PX|Y =yj, with bag-level covariates yj sampled from PY . We concatenate them into vectors x := bx1 . . . bx N and y := [y1 . . . y N] . For simplicity, our exposition will use the notation without bagging i.e. where D1 = {xj, yj}N j=1 when the generality of our contributions will be relevant to the theory of deconditioning from Hsu and Ramos [1], in Sections 2, 3 and 4. We will come back to a bagged dataset formalism in Sections 3.3 and 5, which corresponds to our motivating application of mediated statistical downscaling. With an abuse of notation, we define feature matrices by stacking feature maps along columns as Φx := [kx1 . . . kx N ] and Ψy := [ℓy1 . . . ℓy N ] and we denote Gram matrices as Kxx = Φ x Φx = [k(xi, xj)]1 i,j N and Lyy = Ψ y Ψy = [ℓ(yi, yj)]1 i,j N. The notation abuse ( ) ( ) is a shorthand for the elementwise RKHS inner products when it is clear from the context. Let Z denote the real-valued random variable stemming from the noisy conditional expectation of some unknown latent function f : X R, as Z = EX[f(X)|Y ] + ε. We suppose one observes another set of realizations D2 = { yj, zj}M j=1 from PY Z, which is sampled independently from D1. Likewise, we stack observations into vectors y := [ y1 . . . y M] , z := [ z1 . . . z M] and define feature map Ψ y := [ℓ y1 . . . ℓ y M ]. 2.2 Conditional and Deconditional Kernel Mean Embeddings Marginal and Joint Mean Embeddings Kernel mean embeddings of distributions provide a powerful framework for representing and manipulating distributions without specifying their parametric form [22, 26]. The marginal mean embedding of measure PX is defined as µX := EX[k X] Hk and corresponds to the Riesz representer of expectation functional f 7 EX[f(X)]. It can hence be used to evaluate expectations EX[f(X)] = f, µX k. If the mapping PX 7 µX is injective, the kernel k is said to be characteristic [27], a property satisfied for the Gaussian and Matérn kernels on Rd [28]. In practice, Monte Carlo estimator ˆµX := 1 N PN i=1 kxi provides an unbiased estimate of µX [29]. Extending this rationale to embeddings of joint distributions, we define CY Y := EY [ℓY ℓY ] Hℓ Hℓand CXY := EX,Y [k X ℓY ] Hk Hℓ, which can be identified with the crosscovariance operators between Hilbert spaces CY Y : Hℓ Hℓand CXY : Hℓ Hk. They correspond to the Riesz representers of bilinear forms (g, g ) 7 Cov(g(Y ), g (Y )) = g, CY Y g ℓ and (f, g) 7 Cov(f(X), g(Y )) = f, CXY g k. As above, empirical estimates are obtained as ˆCY Y = 1 N ΨyΨ y = 1 N PN i=1 ℓyi ℓyi and ˆCXY = 1 N ΦxΨ y = 1 N PN i=1 kxi ℓyi. Again, notation abuse ( )( ) is a shorthand for element-wise tensor products when clear from context. Conditional Mean Embeddings Similarly, one can introduce RKHS embeddings for conditional distributions referred to as Conditional Mean Embeddings (CME). The CME of conditional probability measure PX|Y =y is defined as µX|Y =y := EX[k X|Y = y] Hk. As introduced by Fukumizu et al. [27], it is common to formulate conditioning in terms of a Hilbert space operator CX|Y : Hℓ Hk called the Conditional Mean Operator (CMO). CX|Y satisfies by definition CX|Y ℓy = µX|Y =y and C X|Y f = EX[f(X)|Y = ] , f Hk, where C X|Y de- notes the adjoint of CX|Y . Plus, the CMO admits expression CX|Y = CXY C 1 Y Y , provided ℓy Range(CY Y ), y Y [26, 27]. Song et al. [30] show that a nonparametric empirical form of the CMO writes ˆCX|Y = Φx(Lyy + NλIN) 1Ψ y , (1) where λ > 0 is some regularisation ensuring the empirical operator is globally defined and bounded. As observed by Grünewälder et al. [23], since CX|Y defines a mapping from Hℓto Hk, it can be interpreted as the solution to a vector-valued regression problem. This perspective enables derivation of probabilistic convergence bounds on the empirical CMO estimator [23, 25]. Deconditional Mean Embeddings Introduced by Hsu and Ramos [1] as a new class of embeddings, Deconditional Mean Embeddings (DME) are natural counterparts of CMEs. While CME µX|Y =y Hk allows to take the conditional expectation of any f Hk through inner product EX[f(X)|Y = y] = f, µX|Y =y k, the DME denoted µX=x|Y Hℓsolves the inverse problem3 and allows to recover the initial function of which the conditional expectation was taken, through inner product EX[f(X)|Y = ], µX=x|Y ℓ= f(x). The associated Hilbert space operator, the Deconditional Mean Operator (DMO), is thus defined as the operator DX|Y : Hk Hℓsuch that D X|Y EX[f(X)|Y = ] = f , f Hk. It admits an expression in terms of CMO and cross-covariance operators DX|Y = (CX|Y CY Y ) (CX|Y CY Y (CX|Y ) ) 1 provided ℓy Range(CY Y ) and kx Range(CX|Y CY Y C X|Y ) , y Y and x X. A nonparametric empirical es- timate of the DMO using datasets D1 and D2 as described above, is given by ˆDX|Y = Ψ y A Kxx A + MϵIM 1 A Φ x where ϵ > 0 is a regularisation term and A := (Lyy + NλI) 1Ly y can be interpreted as a mediation operator. Applying the DMO to expected responses z, Hsu and Ramos [1] are able to recover an estimate of f as ˆf(x) = k(x, x)A A Kxx A + MϵIM 1 z. (2) Note that since separate samples y can be used to estimate CY Y , this naturally fits a mediating variables setup where x and the conditional means z are not jointly observed. 3 Deconditioning with Gaussian processes In this section, we introduce Conditional Mean Process (CMP), a stochastic process stemming from the conditional expectation of a GP. We provide a characterisation of the CMP and show that the corresponding posterior mean of its integrand is a DME-based estimator. We also derive in Appendix B a variational formulation of our model that scales to large datasets and demonstrate its performance in Section 5. For simplicity, we put aside observations bagging in Sections 3.1 and 3.2, our contributions being relevant to the general theory of DMEs. We return to a bagged formalism in Section 3.3 and extend deconditioning to the multiresolution setup inherent to the mediated statistical downscaling application. In what follows, X is a measurable space, Y a Borel space and feature maps kx and ℓy are Borel-measurable functions for any x X, y Y. All proofs and derivations of the paper are included in the appendix. 3.1 Conditional Mean Process Bayesian quadrature [14, 31, 32] is based on the observation that the integral of a GP with respect to some marginal measure is a Gaussian random variable. We start by probing the implications of integrating with respect to conditional distribution PX|Y =y and considering such integrals as functions of the conditioning variable y. This gives rise to the notion of conditional mean processes. Definition 3.1 (Conditional Mean Process). Let f GP(m, k) with integrable sample paths, i.e. R X |f| d PX < a.s. The CMP induced by f with respect to PX|Y is defined as the stochastic 3the slightly unusual notation µX=x|Y is taken from Hsu and Ramos [1] and is meant to contrast the usual conditioning µX|Y =y process {g(y) : y Y} given by g(y) = EX[f(X)|Y = y] = Z X f(x) d PX|Y =y(x). (3) By linearity of the integral, it is clear that g(y) is a Gaussian random variable for each y Y. The sample paths integrability requirement ensures g is well-defined almost everywhere. The following result characterizes CMP as a GP on Y. Proposition 3.2 (CMP characterization). Suppose EX[|m(X)|] < and EX[ k X k] < and let (X , Y ) PXY . Then g is a Gaussian process g GP(ν, q) a.s. , specified by ν(y) = EX[m(X)|Y = y] q(y, y ) = EX,X [k(X, X )|Y = y, Y = y ] (4) y, y Y. Furthermore, q(y, y ) = µX|Y =y, µX|Y =y k a.s. Intuitively, the CMP can be understood as a GP on the conditional means where its covariance q(y, y ) is induced by the similarity between the CMEs at y and y . Resorting to the kernel ℓdefined on Y, we can reexpress the covariance using Hilbert space operators as q(y, y ) = CX|Y ℓy, CX|Y ℓy k. A natural nonparametric estimate of the CMP covariance thus comes using the CMO estimator from (1) as ˆq(y, y ) = ℓ(y, y) (Lyy + NλIN) 1 Kxx (Lyy + NλIN) 1 ℓ(y, y ). When m Hk, the mean function can be written as ν(y) = µX|Y =y, m k for which we can also use empirical estimate ˆν(y) = ℓ(y, y) (Lyy + NλIN) 1 Φ x m. Finally, one can also quantify the covariance between the CMP g and its integrand f, i.e. Cov(f(x), g(y)) = EX[k(x, X)|Y = y]. Under the same assumptions as Proposition 3.2, this covariance can be expressed using mean embeddings, i.e. Cov(f(x), g(y)) = kx, µX|Y =y k and admits empirical estimate k(x, x) (Lyy + NλIN) 1 ℓ(y, y). 3.2 Deconditional Posterior Given independent observations introduced above, D1 = {x, y} and D2 = { y, z}, we may now consider an additive noise model with CMP prior on aggregate observations z| y N(g( y), σ2IM). Let Q y y := q( y, y) be the kernel matrix induced by q on y and let Υ := Cov(f(x), z) = Φ x CX|Y Ψ y be the cross-covariance between f(x) and z. The joint normality of f(x) and z gives f(x) z | y, y N m(x) ν( y) , Kxx Υ Υ Q y y + σ2IM Using Gaussian conditioning, we can then readily derive the posterior distribution of the underlying GP field f given the aggregate observations z corresponding to y. The latter can naturally be degenerated if observations are paired, i.e. y = y. This formulation can be seen as an example of the inter-domain GP [33], where we utilise the observed conditional means z as inter-domain inducing features for inference of f. Proposition 3.3 (Deconditional Posterior). Given aggregate observations z with homoscedastic noise σ2, the deconditional posterior of f is defined as the Gaussian process f| z GP(md, kd) where md(x) = m(x) + k x CX|Y Ψ y(Q y y + σ2IM) 1( z ν( y)), (6) kd(x, x ) = k(x, x ) k x CX|Y Ψ y(Q y y + σ2IM) 1Ψ y C X|Y kx . (7) Substituting terms by their empirical forms, we can define a nonparametric estimate of the md as ˆmd(x) := m(x) + k(x, x)A( ˆQ y y + σ2IM) 1( z ˆν( y))) (8) which, when m = 0, reduces to the DME-based estimator in (2) by taking the noise variance σ2 N as the inverse regularization parameter. Hsu and Ramos [1] recover a similar posterior mean expression in their Bayesian interpretation of DME. However, they do not link the distributions of f and its CMP, which leads to much more complicated chained inference derivations combining fully Bayesian and MAP estimates, while we naturally recover it using simple Gaussian conditioning. Likewise, an empirical estimate of the deconditional covariance is given by ˆkd(x, x ) := k(x, x ) k(x, x)A( ˆQ y y + σ2IM) 1A k(x, x ). (9) Interestingly, the latter can be rearranged to write as the difference between the original kernel and the kernel undergoing conditioning and deconditioning steps ˆkd(x, x ) = k(x, x ) kx, ˆDX|Y ˆCX|Y kx k. This can be interpreted as a measure of reconstruction quality, which degenerates in the case of perfect deconditioning, i.e. ˆDX|Y ˆCX|Y = Id Hk. 3.3 Deconditioning and multiresolution data Downscaling application would typically correspond to multiresolution data, with bag dataset b D1 = {(bxj, yj)}N j=1 where bxj = {x(i) j }nj i=1. In this setup, the mismatch in size between vec- tor concatenations x = [x(1) 1 . . . x(n N) N ] and y = [y1 . . . y N] prevents from readily applying (1) to estimate the CMO and thus infer the deconditional posterior. There is, however, a straightforward approach to alleviate this: simply replicate bag-level covariates yj to match bags sizes nj. Although simple, this method incurs a O((PN j=1 nj)3) cost due to matrix inversion in (1). Alternatively, since bags bxj are sampled from conditional distribution PX|Y =yj, unbiased Monte Carlo estimators of CMEs are given by ˆµX|Y =yj = 1 nj Pnj i=1 kx(i) j . Let ˆ My = [ˆµX|Y =y1 . . . ˆµX|Y =y N ] denote their concatenation along columns. We can then rewrite the crosscovariance operator as CXY = EY [EX[k X|Y ] ℓY ] and hence take 1 N ˆ MyΨ y as an estimate for CXY . Substituting empirical forms into CX|Y = CX|Y C 1 Y Y and applying Woodbury identity, we obtain an alternative CMO estimator that only requires inversion of a N N matrix. We call it Conditional Mean Shrinkage Operator and define it as S ˆCX|Y := ˆ My(Lyy + λNIN) 1Ψ y . (10) This estimator can be seen as a generalisation of the Kernel Mean Shrinkage Estimator [34] to the conditional case. We provide in Appendix C modifications of (8) and (9) including this estimator for the computation of the deconditional posterior. 4 Deconditioning as regression In Section 3.2, we obtain a DMO-based estimate for the posterior mean of f| z. When the estimate gets closer to the exact operator, the uncertainty collapses and the Bayesian view meets the frequentist. It is however unclear how the empirical operators effectively converge in finite data size. Adopting an alternative perspective, we now demonstrate that the DMO estimate can be obtained as the minimiser of a two-staged vector-valued regression. This frequentist turn enables us to leverage rich theory of vector-valued regression and establish under mild assumptions a convergence rate on the DMO estimator, with conditions to fulfill minimax optimality in terms of statistical-computational efficiency. In the following, we briefly review CMO s vector-valued regression viewpoint and construct an analogous regression problem for DMO. We refer the reader to [35] for a comprehensive overview of vector-valued RKHS theory. As for Sections 3.1 and 3.2, this section contributes to the general theory of DMEs and we hence put aside the bag notations. Stage 1: Regressing the Conditional Mean Operator As first introduced by Grünewälder et al. [23] and generalised to infinite dimensional spaces by Singh et al. [25], estimating C X|Y is equivalent to solving a vector-valued kernel ridge regression problem in the hypothesis space of Hilbert-Schmidt operators from Hk to Hℓ, denoted as HS(Hk, Hℓ). Specifically, we may consider the operator-valued kernel defined over Hk as Γ(f, f ) := f, f k Id Hℓ. We denote HΓ the Hℓ-valued RKHS spanned by Γ with norm Γ, which can be identified to HS(Hk, Hℓ)4. Singh et al. [25] frame CMO regression as the minimisation surrogate discrepancy Ec(C) := EX,Y k X C ℓY 2 k , to which they substitute an empirical regularised version restricted to HΓ given by ˆEc(C) := 1 i=1 kxi C ℓyi 2 k + λ C 2 Γ C HΓ λ > 0 (11) This Hk-valued kernel ridge regression problem admits a closed-form minimiser which shares the same empirical form as the CMO, i.e. ˆC X|Y = arg min C HΓ ˆEc(C) [23, 25]. 4HΓ = Span {Γfh, f Hk, h Hℓ} = Span {f h, f Hk, h Hℓ} = Hk Hℓ = HS(Hk, Hℓ) Stage 2 : Regressing the Deconditional Mean Operator The DMO on the other hand is defined as the operator DX|Y : Hk Hℓsuch that f Hk, D X|Y C X|Y f = f. Since deconditioning corresponds to finding a pseudo-inverse to the CMO, it is natural to consider a reconstruction objective Ed(D) := EY ℓY DCX|Y ℓY 2 ℓ . Introducing a novel characterization of the DMO, we propose to minimise this objective in the hypothesis space of Hilbert-Schmidt operators HS(Hk, Hℓ) which identifies to HΓ. As per above, we denote ˆCX|Y the empirical CMO learnt in Stage 1, and we substitute the loss with an empirical regularised formulation on HΓ ˆEd(D) := 1 j=1 ℓ yj D ˆCX|Y ℓ yj 2 ℓ+ ϵ D 2 Γ D HΓ ϵ > 0 (12) Proposition 4.1 (Empirical DMO as vector-valued regressor). The minimiser of the empirical reconstruction risk is the empirical DMO, i.e. ˆDX|Y = arg min D HΓ ˆEd(D) Since it requires to estimate the CMO first, minimising (12) can be viewed as a two-staged vector value regression problem. Convergence results Following the footsteps of [24, 25], this perspective enables us to state the performance of the DMO estimate in terms of asymptotic convergence of the objective Ed. As in Caponnetto and De Vito [36], we must restrict the class of probability measure for PXY and PY to ensure uniform convergence even when Hk is infinite dimensional. The family of distribution considered is a general class of priors that does not assume parametric distributions and is parametrized by two variables: b > 1 controls the effective input dimension and c ]1, 2] controls functional smoothness. Mild regularity assumptions are also placed on the original spaces X, Y, their corresponding RKHS Hk, Hℓand the vector-valued RKHS HΓ. We discuss these assumptions in details in Appendix D. Importantly, while Hk can be infinite dimensional, we nonetheless have to assume the RKHS Hℓis finite dimensional. In further research, we will relax this assumption. Theorem 4.2 (Empirical DMO Convergence Rate). Denote DPY = arg min D HΓ Ed(D). Assume assumptions stated in Appendix D are satisfied. In particular, let (b, c) and (0, c ) be the parameters of the restricted class of distribution for PY and PXY respectively and let ι ]0, 1] be the Hölder continuity exponent in HΓ. Then, if we choose λ = N 1 c +1 , N = M a(c +1) ι(c 1) where a > 0, we have the following result, If a b(c+1) bc+1 , then Ed( ˆDX|Y ) Ed(DPY ) = O(M ac c+1 ) with ϵ = M a c+1 If a b(c+1) bc+1 , then Ed( ˆDX|Y ) Ed(DPY ) = O(M bc bc+1 ) with ϵ = M b bc+1 This theorem underlines a trade-off between the computational and statistical efficiency with respect to the datasets cardinalities N = |D1|, M = |D2| and the problem difficulty b, c, c . For a b(c+1) bc+1 , smaller a means less samples from D1 at fixed M and thus computational savings. But it also hampers convergence, resulting in reduced statistical efficiency. At a = b(c+1) bc+1 < 2, convergence rate is a minimax computational-statistical efficiency optimal, i.e. convergence rate is optimal with smallest possible M. We note that at this optimal, N > M and which means less samples are required from D2. a b(c+1) bc+1 does not improve the convergence rate but only increases the size of D1 and hence the computational cost it bears. 5 Deconditional Downscaling Experiments We demonstrate and evaluate our CMP-based downscaling approaches on both synthetic experiments and a challenging atmospheric temperature field downscaling problem with unmatched multi-resolution data. We denote the exact CMP deconditional posterior from Section 3 as CMP, the CMP using with efficient shrinkage CMO estimation as S-CMP and the variational formulation as VARCMP. They are compared against VBAGG [9] which we describe below and a GP regression [14] baseline (GPR) modified to take bags centroids as the input. Experiments are implemented in Py Torch [37, 38], all code and datasets are made available5 and computational details are provided in Appendix E. 5https://github.com/shahineb/deconditional-downscaling Variational Aggregate Learning VBAGG is introduced by Law et al. [9] as a variational aggregate learning framework to disaggregate exponential family models, with emphasis on the Poisson family. We consider its application to the Gaussian family, which models the relationship between aggregate targets zj and bag covariates {x(i) j }i by bag-wise averaging of a GP prior on the function of interest. In fact, the Gaussian VBAGG corresponds exactly to a special case of CMP on matched data, where the bag covariates are simply one hot encoded indices with kernel ℓ(j, j ) = δ(j, j ) where δ is the Kronecker delta. However, VBAGG cannot handle unmatched data as bag indices do not instill the smoothness that is used for mediation. For fair analysis, we compare variational methods VARCMP and VBAGG together, and exact methods CMP/S-CMP to an exact version of VBAGG, which we implement and refer to as BAGG-GP. 5.1 Swiss Roll The scikit-learn [39] swiss roll manifold sampling function allows to generate a 3D manifold of points x R3 mapped with their position along the manifold t R. Our objective will be to recover t for each point x by only observing t at an aggregate level. In the first experiment, we compare our model to existing weakly supervised spatial disaggregation methods when all high-resolution covariates are matched with a coarse-level aggregate target. We then proceed to withdraw this requirement in a companion experiment. 5.1.1 Direct matching Figure 2: Step 1: Split space regularly along height. Step 2: Group points into height-level bags. Step 3: Average points targets into bag-level aggregate targets. Experimental Setup Inspired by the experimental setup from Law et al. [9], we regularly split space along height B 1 times as depicted in Figure 2 and group together manifold points within each height level, hence mixing together points with very different positions on the manifold. We obtain bags of samples {(bxj, tj)}B j=1 where the jth bag contains nj points bxj = {x(i) j }nj i=1 and their corresponding targets tj = {t(i) j }nj i=1. We then construct bag aggregate targets by taking noisy bag targets average zj := 1 nj Pnj i=1 t(i) j + εj, where εj N(0, σ2). We thus obtain matched weakly supervised bag learning dataset D = {(bxj, zj)}B j=1. Since each bag corresponds to a height-level, the center altitude of each height split yj R is a natural candidate bag-level covariate that informs on relative positions of the bags. We can augment the above dataset as D = {(bxj, yj, zj)}B j=1. Using these bag datasets, we now wish to downscale aggregate targets zj to recover the unobserved manifold locations {tj}B j=1 and be able to query the target at any previously unseen input x. Models We use a zero-mean prior on f and choose a Gaussian kernel for k and ℓ. Inducing points location is initialized with K-means++ procedure for VARCMP and VBAGG such that they spread evenly across the manifold. For exact methods, kernel hyperparameters and noise variance σ2 are learnt on D by optimising the marginal likelihood. For VARCMP, they are learnt jointly with variational distribution parameters by maximising an evidence lower bound objective. While CMP-based methods can leverage bag-level covariates yj, baselines are restricted to learn from D . Adam optimiser [40] is used in all experiments. Results We test models against unobserved groundtruth {tj}B j=1 by evaluating the root mean square error (RMSE) to the posterior mean. Table 1 shows that CMP, S-CMP and VARCMP outperform their corresponding counterparts i.e. BAGG-GP and VBAGG, with statistical significance confirmed by a Wilcoxon signed-rank test in Appendix E. Most notably, this shows that the additional knowledge on bag-level dependence instilled by ℓis reflected even in a setting where each bag is matched with an aggregate target. Table 1: RMSE of the swissroll experiment for models trained over directly and indirectly matched datasets ; scores averaged over 20 seeds and 1 s.d is reported ; * indicates our proposed methods. Matching CMP* S-CMP* VARCMP* BAGG-GP VBAGG GPR Direct 0.33 0.06 0.25 0.04 0.18 0.04 0.60 0.01 0.22 0.04 0.70 0.05 Indirect 0.80 0.14 1.05 0.04 0.87 0.07 1.13 0.11 1.46 0.34 1.04 0.05 5.1.2 Indirect matching Experimental Setup We now impose indirect matching through mediating variable yj. We randomly select N = B 2 bags which we consider to be the N first ones without loss of generality and split D into D1 = {(bxj, yj)}N j=1 and D2 = {( yj, zj)}B N j=1 = {(y N+j, z N+j)}B N j=1 , such that no pair of covariates bag bxj and aggregate target zj are jointly observed. Models CMP-based methods are naturally able to learn from this setting and are trained by independently drawing samples from D1 and D2. Baseline methods however require bags of covariates to be matched with an aggregate bag target. To remedy this, we place a separate prior g GP(0, ℓ) and fit regression model zj = g( yj) + εj over D2. We then use the predictive posterior mean to augment the first dataset as D 1 = bxj, E[g(yj)|D2] N j=1. This dataset can then be used to train BAGG-GP, VBAGG and GPR. Results For comparison, we use the same evaluation as in the direct matching experiment. Table 1 underlines an anticipated drop in RMSE for all models, but we observe that BAGG-GP and VBAGG suffer most from the mediated matching of the dataset while CMP and VARCMP report best scores by a substantial margin. This highlights how using a separate prior on g to mediate D1 and D2 turns out to be suboptimal in contrast to using the prior naturally implied by CMP. While it is more computationally efficient than CMP, we observe a relative drop in performance for S-CMP. 5.2 Mediated downscaling of atmospheric temperature Given the large diversity of sources and formats of remote sensing and model data, expecting directly matched observations is often unrealistic [41]. For example, two distinct satellite products will often provide low and high resolution imagery that can be matched neither spatially nor temporally [2, 3, 4, 5]. Climate simulations [19, 20, 21] on the other hand provide a comprehensive coarse resolution coverage of meteorological variables that can serve as a mediating dataset. For the purpose of demonstration, we create an experimental setup inspired by this problem using Coupled Model Intercomparison Project Phase 6 (CMIP6) [19] simulation data. This grants us access to groundtruth high-resolution covariates to facilitate model evaluation. Experimental Setup We collect monthly mean 2D atmospheric fields simulation from CMIP6 data [42, 43] for the following variables: air temperature at cloud top (T), mean cloud top pressure (P), total cloud cover (TCC) and cloud albedo (α). First, we collocate TCC and α onto a HR latitudelongitude grid of size 360 720 to obtain fine grained fields (latitude HR, longitude HR, altitude HR, TCCHR, αHR) augmented with a static HR surface altitude field. Then we collocate P and T onto a LR grid of size 21 42 to obtain coarse grained fields (latitude LR, longitude LR, PLR, TLR). We denote by B the number of low resolution pixels. Our objective is to disaggregate TLR to the HR fields granularity. We assimilate the jth coarse temperature pixel to an aggregate target zj := TLR j corresponding to bag j. Each bag includes HR covariates bxj = {x(i) j }nj i=1 := {(latitude HR(i) j , longitude HR(i) j , altitude HR(i) j , TCCHR(i) j , αHR(i) j )}nj i=1. To emulate unmatched observations, we randomly select N = B 2 of the bags {bxj}N j=1 and keep the opposite half of LR observations {z N+j}B N j=1 , such that there is no single aggregate bag target that corresponds to one of the available bags. Finally, we choose the pressure field PLR as the mediating variable. We hence compose a third party low resolution field of bag-level covariates yj := (latitude LR j , longitude LR j , PLR j ) which can separately be matched with both above sets to obtain datasets D1 = {(xj, yj)}N j=1 and D2 = {( yj, zj)}B N j=1 = {(y N+j, z N+j)}B N j=1 . Figure 3: Left: High-resolution atmoshperic covariates used for prediction; Center-Left: Observed low-resolution temperature field, grey pixels are unobserved; Center Unobserved high-resolution groundtruth temperature field; Center-Right: VARCMP deconditional posterior mean; Right 95% confidence region size on prediction; temperature values are in Kelvin. Table 2: Downscaling similarity scores of posterior mean against groundtruth high resolution cloud top temperature field ; averaged over 10 seeds; we report 1 s.d. ; : lower is better ; : higher is better. Model RMSE MAE Corr. SSIM VARGPR 8.02 0.28 5.55 0.17 0.831 0.012 0.212 0.011 VBAGG 8.25 0.15 5.82 0.11 0.821 0.006 0.182 0.004 VARCMP 7.40 0.25 5.34 0.22 0.848 0.011 0.212 0.013 Models Setup We only consider variational methods to scale to large number of pixels. VARCMP is naturally able to learn from indirectly matched data. We use a Matérn-1.5 kernel for rough spatial covariates (latitude, longitude) and a Gaussian kernel for atmospheric covariates (P, TCC, α) and surface altitude. k and ℓare both taken as sums of Matérn and Gaussian kernels, and their hyperparameters are learnt along with noise variance during training. A high-resolution noise term is also introduced, with details provided in Appendix E. Inducing points locations are uniformly initialized across the HR grid. We replace GPR with an inducing point variational counterpart VARGPR [15]. Since baseline methods require a matched dataset, we proceed as with the unmatched swiss roll experiment and fit a GP regression model g with kernel ℓon D2 and then use its predictive posterior mean to obtain pseudo-targets for the bags of HR covariates from D1. Results Performance is evaluated by comparing downscaling deconditional posterior mean against original high resolution field THR available in CMIP6 data [43], which we emphasise is never observed. We use random Fourier features [44] approximation of kernel k to scale kernel evaluation to the HR covariates grid during testing. As reported in Table 2, VARCMP substantially outperforms both baselines with statistical significance provided in Appendix E. Figure 3 shows the reconstructed image with VARCMP, plots for other methods are included in the Appendix E. The model resolves statistical patterns from HR covariates into coarse resolution temperature pixels, henceforth reconstructing a faithful HR version of the temperature field. 6 Discussion We introduced a scalable Bayesian solution to the mediated statistical downscaling problem, which handles unmatched multi-resolution data. The proposed approach combines Gaussian Processes with the framework of deconditioning using RKHSs and recovers previous approaches as its special cases. We provided convergence rates for the associated deconditioning operator. Finally, we demonstrated the advantages over spatial disaggregation baselines in synthetic data and in a challenging atmospheric fields downscaling problem. In future work, exploring theoretical guarantees of the computationally efficient shrinkage formulation in a multi-resolution setting and relaxing finite dimensionality assumptions for the convergence rate will have fruitful practical and theoretical implications. Further directions also open up to quantify uncertainty over the deconditional posterior since it is computed using empirical estimates of the CMP covariance. This may be problematic if the mediating variable undergoes covariate shift between the two datasets. Acknowledgments SLC is supported by the EPSRC and MRC through the Ox Wa SP CDT programme EP/L016710/1. SB is supported by the EU s Horizon 2020 research and innovation programme under Marie Skłodowska Curie grant agreement No 860100. DS is supported in partly by Tencent AI lab and partly by the Alan Turing Institute (EP/N510129/1). [1] Kelvin Hsu and Fabio Ramos. Bayesian Deconditional Kernel Mean Embeddings. Proceedings of Machine Learning Research. PMLR, 2019. [2] L. A. Remer, Y. J. Kaufman, D. Tanré, S. Mattoo, D. A. Chu, J. V. Martins, R.-R. Li, C. Ichoku, R. C. Levy, R. G. Kleidman, T. F. Eck, E. Vermote, and B. N. Holben. The MODIS Aerosol Algorithm, Products, and Validation. Journal of the Atmospheric Sciences, 2005. [3] S. Platnick, M.D. King, S.A. Ackerman, W.P. Menzel, B.A. Baum, J.C. Riedi, and R.A. Frey. The MODIS cloud products: algorithms and examples from Terra. IEEE Transactions on Geoscience and Remote Sensing, 2003. [4] Graeme L Stephens, Deborah G Vane, Ronald J Boain, Gerald G Mace, Kenneth Sassen, Zhien Wang, Anthony J Illingworth, Ewan J O connor, William B Rossow, Stephen L Durden, et al. The Cloud Sat mission and the A-train: A new dimension of space-based observations of clouds and precipitation. Bulletin of the American Meteorological Society, 2002. [5] William L. Barnes, Thomas S. Pagano, and Vincent V. Salomonson. Prelaunch characteristics of the Moderate Resolution Imaging Spectroradiometer (MODIS) on EOS-AM1. IEEE Transactions on Geoscience and Remote Sensing, 1998. [6] Zbynˇek Malenovský, Helmut Rott, Josef Cihlar, Michael E. Schaepman, Glenda García-Santos, Richard Fernandes, and Michael Berger. Sentinels for science: Potential of Sentinel-1, -2, and -3 missions for scientific observations of ocean, cryosphere, and land. Remote Sensing of Environment, 2012. [7] Edward J. Knight and Geir Kvaran. Landsat-8 operational land imager design. Remote Sensing of Environment, 2014. [8] Yivan Zhang, Nontawat Charoenphakdee, Zhenguo Wu, and Masashi Sugiyama. Learning from aggregate observations. In Advances in Neural Information Processing Systems, 2020. [9] Leon Ho Chung Law, Dino Sejdinovic, Ewan Cameron, Tim C.D. Lucas, Seth Flaxman, Katherine Battle, and Kenji Fukumizu. Variational learning on aggregate outputs with Gaussian processes. In Advances in Neural Information Processing Systems, 2018. [10] Oliver Hamelijnck, Theodoros Damoulas, Kangrui Wang, and Mark A. Girolami. Multiresolution multi-task Gaussian processes. In Advances in Neural Information Processing Systems, 2019. [11] Fariba Yousefi, Michael Thomas Smith, and Mauricio A. Álvarez. Multi-task learning for aggregated data using Gaussian processes. In Advances in Neural Information Processing Systems, 2019. [12] Yusuke Tanaka, Toshiyuki Tanaka, Tomoharu Iwata, Takeshi Kurashima, Maya Okawa, Yasunori Akagi, and Hiroyuki Toda. Spatially aggregated Gaussian processes with multivariate areal outputs. Advances in Neural Information Processing Systems, 2019. [13] Arto Klam Ville Tanskanen , Krista Longi. Non-Linearities in Gaussian Processes with Integral Observations. IEEE international Workshop on Machine Learning for Signal, 2020. [14] C Rasmussen and C Williams. Gaussian Processes for Machine Learning, 2005. [15] Michalis K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In AISTATS, 2009. [16] Anna Vaughan, Will Tebbutt, J Scott Hosking, and Richard E Turner. Convolutional conditional neural processes for local climate downscaling. Geoscientific Model Development Discussions, pages 1 25, 2021. [17] Brian Groenke, Luke Madaus, and Claire Monteleoni. Climalign: Unsupervised statistical downscaling of climate variables via normalizing flows. In Proceedings of the 10th International Conference on Climate Informatics, pages 60 66, 2020. [18] Thomas Vandal, Evan Kodra, Sangram Ganguly, Andrew Michaelis, Ramakrishna Nemani, and Auroop R Ganguly. Deepsd: Generating high resolution climate change projections through single image super-resolution. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining, pages 1663 1672, 2017. [19] Veronika Eyring, Sandrine Bony, Gerald A Meehl, Catherine A Senior, Bjorn Stevens, Ronald J Stouffer, and Karl E Taylor. Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geoscientific Model Development, 2016. [20] Gregory M. Flato. Earth system models: an overview. WIREs Climate Change, 2011. [21] Marko Scholze, J. Icarus Allen, William J. Collins, Sarah E. Cornell, Chris Huntingford, Manoj M. Joshi, Jason A. Lowe, Robin S. Smith, and Oliver Wild. Earth system models. Cambridge University Press, 2012. [22] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond. ar Xiv preprint ar Xiv:1605.09522, 2016. [23] Steffen Grünewälder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massimilano Pontil. Conditional Mean Embeddings as Regressors. In Proceedings of the 29th International Coference on International Conference on Machine Learning, 2012. [24] Zoltán Szabó, Bharath K. Sriperumbudur, Barnabás Póczos, and Arthur Gretton. Learning theory for distribution regression. Journal of Machine Learning Research, 2016. [25] Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. In Advances in Neural Information Processing Systems, 2019. [26] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 2013. [27] Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 2004. [28] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, 2008. [29] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert R. G. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 2012. [30] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, 2009. [31] FM Larkin. Gaussian measure in Hilbert space and applications in numerical analysis. The Rocky Mountain Journal of Mathematics, 1972. [32] François-Xavier Briol, Chris J Oates, Mark Girolami, Michael A Osborne, Dino Sejdinovic, et al. Probabilistic integration: A role in statistical computation? Statistical Science, 2019. [33] Tim GJ Rudner, Dino Sejdinovic, and Yarin Gal. Inter-domain deep Gaussian processes. In International Conference on Machine Learning, 2020. [34] Krikamol Muandet, Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, and Bernhard Schölkopf. Kernel mean shrinkage estimators. Journal of Machine Learning Research, 2016. [35] Vern Paulsen and Mrinal Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces. Cambridge university press, 2016. [36] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 2007. [37] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py Torch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. 2019. [38] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Advances in Neural Information Processing Systems, 2018. [39] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 2011. [40] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR, 2015. [41] D. Watson-Parris, N. Schutgens, N. Cook, Z. Kipling, P. Kershaw, E. Gryspeerdt, B. Lawrence, and P. Stier. Community Intercomparison Suite (CIS) v1.4.0: a tool for intercomparing models and observations. Geoscientific Model Development, 2016. [42] Malcolm Roberts. MOHC Had GEM3-GC31-HM model output prepared for CMIP6 High Res MIP hist-1950. Earth System Grid Federation, 2018 v20180730. doi: 10.22033/ESGF/ CMIP6.6040. [43] Aurore Voldoire. CNRM-CERFACS CNRM-CM6-1-HR model output prepared for CMIP6 High Res MIP hist-1950. Earth System Grid Federation, 2019 v20190221. doi: 10.22033/ESGF/ CMIP6.4040. [44] Ali Rahimi, Benjamin Recht, et al. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2007. [45] Olav Kallenberg. Foundations of Modern Probability. Springer, 2002. [46] Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation, 2005. [47] Ho Chung Law, Peilin Zhao, Leung Sing Chan, Junzhou Huang, and Dino Sejdinovic. Hyperparameter learning via distributional transfer. In Advances in Neural Information Processing Systems, 2019. [48] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Publishing Company, Incorporated, 2008.