# psd_representations_for_effective_probability_models__5682b917.pdf PSD Representations for Effective Probability Models Alessandro Rudi Inria, École normale supérieure, CNRS, PSL Research University, Paris, France alessandro.rudi@inria.fr Carlo Ciliberto Department of Computer Science University College London, London, UK c.ciliberto@ucl.ac.uk Finding a good way to model probability densities is key to probabilistic inference. An ideal model should be able to concisely approximate any probability while being also compatible with two main operations: multiplications of two models (product rule) and marginalization with respect to a subset of the random variables (sum rule). In this work, we show that a recently proposed class of positive semidefinite (PSD) models for non-negative functions is particularly suited to this end. In particular, we characterize both approximation and generalization capabilities of PSD models, showing that they enjoy strong theoretical guarantees. Moreover, we show that we can perform efficiently both sum and product rule in closed form via matrix operations, enjoying the same versatility of mixture models. Our results open the way to applications of PSD models to density estimation, decision theory and inference. 1 Introduction Modeling probability distributions is a key task for many applications in machine learning [17, 5]. To this end, several strategies have been proposed in the literature, such as adopting mixture models (e.g. Gaussian mixtures) [5], exponential models [32], implicit generative models [11, 13] or kernel (conditional) mean embeddings [16]. An ideal probabilistic model should have two main features: i) efficiently perform key operations for probabilistic inference, such as sum rule (i.e. marginalization) and product rule [5] and, ii) concisely approximate a large class of probabilities. Finding models that satisfy these two conditions is challenging and current methods tend to tackle only one of the two. Exponential and implicit generative models have typically strong approximation properties (see e.g. [32, 29]) but cannot easily perform operations such as marginalization. On the contrary, mixture models are designed to efficient integrate and multiply probabilities, but tend to require a large number of components to approximate complicated distributions. In principle, mixture models would offer an appealing strategy to model probability densities since they allow for efficient computations when performing key operations such as sum and product rule. However, these advantages in terms of computations, come as a disadvantage in terms of expressiveness: even though mixture models are universal approximators (namely they can approximate arbitrarily well any probability density), they require a significant number n of observations and of components to do that. Indeed it is known that models that are non-negative mixtures of non-negative components lead to learning rates that suffer from the curse of dimensionality. For example, when approximating probabilities on Rd, kernel density estimation (KDE) [21] with non-negative components has rates as slow as n 2/(4+d) (see, e.g. [38, page 100]), and cannot improve with the regularity (smoothness) of the target probability (see e.g. [9, Thm. 16.1] for impossibility results in the one dimensional case). In the past decades, this limitation has been overcome by removing either the non-negativity of the weights of the mixture (leading to RBF networks [27][35]), or the non-negativity of the components 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Non-negative Sum Product Concise Optimal Efficient Rule Rule Approximation Learning Sampling Linear Models 7 X X X X 7 Mean Embeddings 7 X X X X 7 Mixture Models X X X 7 7 X Exponential Models X 7 X X X 7 PSD Models X X X X X X (see [14]) (Prop. 1) (Prop. 2) (Thm. 6) (Thm. 7) (see [15]) Table 1: Summary of the main desirable properties for a probability model. used in the mixture (leading to KDE with oscillating kernel [35]), or both. On the positive side this allows to achieve a learning rate of n β/(2β+d), for β-times differentiable densities on Rd. Note that such rate is minimax optimal [10] and overcomes the curse of dimensionality when β d. Additionally the resulting model is very concise, since only m = O(nd/β) centers are necessary to achieve the optimal rate [36]. However, on the negative side, the resulting model is not a probability, since it may attain negative values, and so it cannot be used where a proper probability is required. In this paper, we show that the positive semidefinite (PSD) models, recently proposed in [14], offer a way out of this dichotomy. By construction PSD models generalize mixture models by allowing also for negative weights, while still guaranteeing that the resulting function is non-negative everywhere. Here, we prove that they get the best of both worlds: expressivity (optimal learning and approximation rates with concise models) and flexibility (exact and efficient sum and product rule). Contributions. The main contributions of this paper are: i) Showing that PSD models can perform exact sum and product rules in terms of efficient matrix operations (Sec. 2.1). ii) Characterizing the approximation and learning properties of PSD models with respect to a large family of probability densities (Sec. 3). iii) Providing a compression method to control the number of components of a PSD model (introduced in Sec. 2.2 and analyzed in Sec. 3.3). iv) Discussing a number of possible applications of PSD models (Sec. 4). Summary of the results. Table 1 summarizes all the desirable properties that a probability model should have and which of them are satisfied by state-of-the-art estimators. Among these we have: non-negativity, the model should be point-wise non-negative; sum and product rule, the model should allow for efficient computation of such operations between probabilities; concise approximation, a model with the optimal number of components m = O(" d/β) is enough to approximate with error " a non-negative function in a wide family of smooth β-times differentiable functions [36]; optimal learning, the model achieves optimal learning rates of order n β/(2β+d) when learning the unknown target probability from i.i.d. samples [35]; efficient sampling the model allows to extract i.i.d. samples without incurring in the curse of dimensionality in terms of computational complexity. We note that for PSD models, this last property has been recently investigated in [15]. We consider under the umbrella of linear models all the models which corresponds to a mixture Pm j=1 wjfj(x) of functions fj : X ! R, with either wj 0, 8j = 1, . . . , m or fj 0, 8j = 1, . . . , m or both. We denote by mixture models the mixture models defined as above, where wj 0, fj 0, 8j = 1, . . . , m. By mean embeddings, we denote the kernel mean embedding estimator framework from [30] see also [16]). With exponential models we refer to models of the form exp(Pm j=1 wjfj(x)) with wj 2 R, fj : X ! R [32]. With PSD models we refer to the framework introduced in [14] and studied in this paper. We note that this work contributes in showing that the latter are the only probability models to satisfy all requirements (we recall that non-negativity and efficient sampling have been shown respectively in [14] and [15]). Notation. We denote by Rd ++ the space vectors in Rd with positive entries, Rn d the space of n d matrices, Sn + = S+(Rn) the space of positive semidefinite n n matrices. Given a vector 2 Rd, we denote diag( ) 2 Rd d the diagonal matrix associated to . We denote by A B and A B respectively the entry-wise and Kronecker product between two matrices A and B. We denote by k Ak F , k Ak, det(A), vec(A) and A> respectively the Frobenius norm, the operator norm (i.e. maximum singular value), the determinant, the (column-wise) vectorization of a matrix and the (conjugate) transpose of A. With some abuse of notation, where clear from context we write element-wise products and division of vectors u, v 2 Rd as uv, u/v. Given two matrices X 2 Rn d1, Y 2 Rn d2 with same number of rows, we denote by [X, Y ] 2 Rn (d1+d2) their concatenation row-wise. The term 1n 2 Rn denotes the vector with all entries equal to 1. 2 PSD Models Following [14], in this work we consider the family of positive semi-definite (PSD) models, namely non-negative functions parametrized by a feature map φ : X ! H from an input space X to a suitable feature space H (a separable Hilbert space e.g. Rq) and a linear operator M 2 S+(H), of the form f(x ; M, φ) = φ(x)>M φ(x). (1) PSD models offer a general way to parametrize non-negative functions (since M is positive semidefinite, f(x ; M, φ) 0 for any x 2 X) and enjoy several additional appealing properties discussed in the following. In this work, we will focus on a special family of models of the form (1) to parametrize probability densities over X = Rd. In particular, we will consider the case where: i) φ = φ : Rd ! H is a feature map associated to the Gaussian kernel [26] k (x, x0) = φ (x)>φ (x) = e (x x0)>diag( )(x x0), with 2 Rd ++ and, ii) the operator M lives in the span of φ(x1), . . . , φ(xn) for a given set of points (xi)n i=1, namely there exists A 2 Sn + such that M = P ij Aijφ(xi)φ(xj)>. We define a Gaussian PSD model by specializing the definition in (1) as f(x ; A, X, ) = Ai,jk (xi, x)k (xj, x), 8 x 2 Rd (2) in terms of the coefficient matrix A 2 Sn +, the base points matrix X 2 Rn d, whose i-th row corresponds to the point xi for each i = 1, . . . , n and kernel parameter . In the following, given two base point matrices X 2 Rn d and X0 2 Rm d, we denote by KX,X0, 2 Rn m the kernel matrix with entries (KX,X0, )ij = k (xi, x0 j) where xi, x0 j are the i-th and j-th rows of X, X0 respectively. When clear from context, in the following we will refer to Gaussian PSD models as PSD models. Remark 1 (PSD models generalize Mixture models). Mixture models (a mixture of Gaussian distributions) are a special case of PSD models. Let A = diag(a) be a diagonal matrix of n positive weights a 2 Rn ++. We have f(x ; A, X, /2) = Pn i=1 Aiik /2(xi, x)2 = Pn i=1 aik (xi, x). Remark 2 (PSD models allow negative weights). From (2), we immediately see that PSD models generalize mixture models by allowing also for negative weights: e.g., f( ; A, X, ) with A = (1, 1 +, = 1, X = (x1; x2) and x1 = 0, x2 = 2, corresponding to f(x ; A, X, ) = e 2x2 + 1 4e 2(x 2)2 1 ee 2(x 1)2, i.e. a mixture of Gaussians with also negative weights. 2.1 Operations with PSD models In Sec. 3 we will show that PSD models can approximate a wide class of probability densities, significantly outperforming mixture models. Here we show that this improvement does not come at the expenses of computations. In particular, we show that PSD models enjoy the same flexibility of mixture models: i) they are closed with respect to key operations such as marginalization and multiplication and ii) these operations can be performed efficiently in terms of matrix sums/products. The derivation of the results reported in the following is provided in Appendix F. They follow from well-known properties of the Gaussian function. Evaluation. Evaluating a PSD model in a point x0 2 X corresponds to f(x = x0 ; A, X, ) = K> X,x0, AKX,x0, . Moreover, partially evaluating a PSD in one variable yields f(x, y = y0 ; A, [X, Y ], ( 1, 2)) = f(x ; B, X, 1) with B = A (KY,y0, 2K> Y,y0, 2). (3) Note that f(x ; B, X, 1) is still a PSD model since B is positive semidefinite. Sum Rule (Marginalization and Integration). The integral of a PSD model can be computed as f(x ; A, X, ) dx = c2 Tr(A KX,X, 2 ) with c = k (0, x) dx, (4) where c = d/2 det(diag( )) 1/2. This is particularly useful to model probabiliy densities with PSD models. Let Z = f(x ; A, X, )dx, then the function f(x ; A/Z, X, ) = 1 Z f(x ; A, X, ) is a probability density. Integrating only one variable of a PSD model we obtain the sum rule. Proposition 1 (Sum Rule Marginalization). Let X 2 Rn d, Y 2 Rn d0, A 2 S+(Rn) and 2 Rd ++, 0 2 Rd0 ++. Then, the following integral is a PSD model Z f(x, y ; A, [X, Y ], ( , 0)) dx = f(y ; B, Y, 0), with B = c2 A KX,X, The result above shows that we can efficiently marginalize a PSD model with respect to one variable by means of an entry-wise multiplication between two matrices. Remark 3 (Integration and marginalization on the hypercube). The integrals in (4) and (5) can be performed also on when X is a hypercube H = Qd t=1[at, bt] rather than the entire space Rd. This leads to a closed form, where the matrix KX,X, 2 is replaced by a suitable KX,X, 2 ,H that can be computed with same number of operations (the full form of such matrix is reported in Appendix F). Product Rule (Multiplication). Multiplying two probabilities is key to several applications in probabilistic inference [5]. The family of PSD models is closed with respect to this operation. Proposition 2 (Multiplication). Let X 2 Rn d1, Y 2 Rn d2, Y 0 2 Rm d2, Z 2 Rm d3, A 2 Sn + and 1 2 Rd1 ++, 3 2 Rd3 f(x, y ; A, [X, Y ], ( 1, 2))f(y, z ; B, [Y 0, Z], ( 0 2, 3)) = f(x, y, z ; C, W, ), (6) is a PSD model, where C = (A B) vec(KY,Y 0,e 2)vec(KY,Y 0,e 2)>' , with 2 = 2 0 matrix W = [X 1m, 2 2+ 0 2 1n Y 0, 1n Z] and = We note that, despite the heavy notation, multiplying two PSD is performed via simple operations such as tensor and entry-wise product between matrices. In particular, we note that X 1m 2 Rnm d1 and 1n Z 2 Rnm d3 correspond respectively to the nm d1-matrix containing m copies of each row of X and the nm d3-matrix containing n copies of Z. Finally, Y 1m + 01n Y is the nm d2 base point matrix containing all possible sums ( yi + 0y0 i,j=1 of points from Y and Y 0. Reduction. As observed above, when performing operations with PSD models, the resulting base point matrix might be of the form X 1m (e.g. if we couple the multiplication in (6) with marginalization as in a Markov transition, see the corollary below). In these cases we can reduce the PSD model to have only n base points (rather than nm), as follows f(x ; A, X 1m, ) = f(x ; B, X, ) with B = (In 1> m)A(In 1m), (7) where A 2 Snm + and In 2 Rn n is the n n identity matrix. The reduction operation is useful to avoid the dimensionality of the PSD model grow unnecessarily. This is for instance the case of a Markov transition. Corollary 3 (Markov Transition). Let X 2 Rn d1, Y 2 Rn d2, Y 0 2 Rm d2, A 2 Sn + and 1 2 Rd1 f(x, y ; A, [X, Y ], ( 1, 2))f(y ; B, Y 0, 0 2) dy = f(x ; C, X, 1), (8) with C 2 Sn + obtained by applying in order, Prop. 2, Prop. 1 and reduction (7). We remark that the result of a Markov transition retains the same base point matrix X and parameters of the transition kernel. This is thanks to the reduction operation in (7), which avoids the resulting matrix C to be nm nm. This fact is particularly useful in applications that require multiple Markov transitions, such as in hidden Markov models (see also Sec. 4). 2.2 Compression of a PSD model From Prop. 2 we note that the multiplication operation can rapidly yield a large number of base points and thus incur in high computational complexity in some settings. It might therefore be useful to have a method to reduce the number of base points while retaining essentially the same model. To this purpose, here we propose a dimensionality reduction strategy. In particular, given a set of points x1, . . . , xm 2 Rd, we leverage the representation of the PSD model in terms of reproducing kernel Hilbert spaces [3] to use powerful sketching techniques as Nyström projection (see, e.g., [40]), to project the PSD model on a new PSD model now based only on the new points. Given A 2 Sn +, X 2 Rn d, 2 Rd ++. Let X 2 Rm d be the base point matrix whose j-rows corresponds to the point xj. The compression of f( ; A, X, ) corresponds to f(x ; A, X, ) with A = BAB> 2 Sm e X, e X, K e X,X, 2 Rm n, (9) which it is still a PSD model since the matrix BAB> 2 Sm +. We not that even with a rather simple strategy to choose the new base points x1, . . . , xm such as uniform sampling compression is an effective tool to reduce the computational complexity of costly operations. In particular, in Sec. 3.3 we show that a compressed model with only O(t polylog(1/")) centers (instead of t2) can "-approximate the product of two PSD models with t points each. 3 Representation power of PSD models In this section we study the theoretical properties of Gaussian PSD models. We start by showing that they admit concise approximations of the target density (in the sense discussed in the introduction to this paper and of Table 1). We then proceed to studying the setting in which we aim to learn an unknown probability from i.i.d. samples. We conclude the section by characterizing the approximation properties of the compression operation introduced in Sec. 2.2. 3.1 Approximation properties of Gaussian PSD models We start the section recalling that Gaussian PSD models are universal approximators for probability densities. In particular, the following result restates [Thm. 2 14] for the case of probabilities. Proposition 4 (Universal consistency Thm. 2 in [14]). The Gaussian PSD family is a universal approximator for probabilities that admit a density. The result above is not surprising since Gaussian PSD models generalize classical Gaussian mixtures (see Remark 1), which are known to be universal [5]. We now introduce a mild assumption that will enable us to complement Prop. 4 with approximation and learning results. In the rest of the section we assume that X = ( 1, 1)d (or more generally an open bounded subset of Rd with Lipschitz boundary). Here L1(X) and L2(X) denote respectively the space of essentially bounded and square-integrable functions over X, while W β 2 (X) denotes the Sobolev space of functions whose weak derivatives up to order β are square-integrable on X (see [1] or Appendix A for more details). Assumption 1. Let β > 0, q 2 N. There exists f1, . . . , fq 2 W β 2 (X)\L1(X), such that the density p : X ! R satisfies fj(x)2, 8x 2 X. (10) The assumption above is quite general and satisfied by a wide family of probabilities, as discussed in the following proposition. The proof is reported in Appendix D.1 Proposition 5 (Generality of Assumption 1). The assumption above is satisfied by (a) any probability density p 2 W β 2 (X) \ L1(X) and strictly positive on [ 1, 1]d, (b) any exponential model p(x) = e v(x) with v 2 W β 2 (X) \ L1(X), (c) any mixture model of Gaussians or, more generally, of exponential models from (b), (d) any p that is β + 2-times differentiable on [ 1, 1]d, with a finite set of zeros, all in ( 1, 1)d, and with positive definite Hessian in each zero. E.g. p(x) / x2e x2. Moreover if p is β-times differentiable over [ 1, 1]d, then it belongs to W β 2 (X) \ L1(X). We note that in principle the class Cβ,d = W β 2 (X) \ L1(X) is larger than the Nikolskii class usually considered in density estimation [10] when β 2 N, but Assumption 1 imposes a restriction on it. However, Prop. 5 shows that the restriction imposed by Assumption 1 is quite mild, since it the resulting space includes all probabilities that are β-times differentiable and strictly positive (it also contains probabilities that are not strictly positive but have some zeros, see [24]). We can now proceed to the main result of this work, which characterizes the approximation capabilities of PSD models. Theorem 6 (Conciseness of PSD Approximation). Let p satisfy Assumption 1. Let " > 0. There exists a Gaussian PSD model of dimension m 2 N, i.e., ˆpm(x) = f(x ; Am, Xm, m), with Am 2 Sm + and Xm 2 Rm d and m 2 Rd ++, such that kp ˆpmk L2(X) ", with m = O(" d/β(log 1 ")d/2). (11) The proof of Thm. 6 is reported in Appendix D.3. According to the result, the number of base points needed for a PSD model to approximate a density up to precision " depends on its smoothness (the smoother the better) and matches the bound m = " d/β that is optimal for function interpolation [18], corresponding to models allowing for negative weights, and is also optimal for convex combinations of oscillating kernels [36]. 3.2 Learning a density with PSD models In this section we study the capabilities of PSD models to estimate a density from n samples. Let X = ( 1, 1)d and let p be a probability on X. Denote by x1, . . . , xn the samples independently and identically distributed according to p, with n 2 N. We consider a Gaussian PSD estimator ˆpn,m = f(x ; ˆA, X, ) that is built on top of m additional points x1, . . . , xm, sampled independenly and uniformly at random in X. In particular, 2 Rd ++, X 2 Rm d is the base point matrix whose j-th row corresponds to the point xj and ˆA 2 Sm + is trained as follows ˆA = argmin f(x ; A, X, )2dx 2 f(xi ; A, X, ) + λk K1/2AK1/2k2 where K = K X, X, . Note that the functional is constituted by two parts. The first two elements are an empirical version of kf pk2 L2(X) modulo a constant independent of f (and so not affecting the optimization problem), since xi are identically distributed according to p and so 1 i=1 f(xi) R f(x)p(x)dx. The last term is a regularizer and corresponds to k K1/2AK1/2k2 F = Tr(AKAK), i.e. the Frobenius norm of M = Pm ij=1 Aijφ ( xi)φ ( xj). The problem in (12) corresponds to a quadratic problem with a semidefinite constraint and can be solved using techniques such as Newton method [24] or first order dual methods [14]. We are now ready to state our result. Theorem 7. Let n, m 2 N, λ > 0, 2 Rd ++ and p be a density satisfying Assumption 1. With the definitions above, let ˆpn,m be the model ˆpn,m(x) = f(x ; ˆA, X, ), with ˆA the minimizer of (12). Let = n 2 2β+d 1d and λ = n 2β+2d 2β+d . When m C0n d 2β+d (log n)d log(C00n(log n)), the following holds with probability at least 1 δ, kp ˆpn,mk L2(X) Cn β 2β+d (log n)d/2, (13) where constant C depends only on β, d and p and the constants C0, C00 depend only on β, d. The proof of Thm. 7 is reported in Appendix E.2. The theorem guarantees that under Assumption 1, Gaussian PSD models can achieve the rate O(n β/(2β+d)) that is optimal for the β-times differentiable densities while admitting a concise representation. Indeed, it needs a number m = O(nd/(2β+d)) of base points, matching the optimal rate in [36]. When β d, a model with m = O(n1/3) centers achieves optimal learning rates. 3.3 The Effect of compression We have seen in the previous section that Gaussian PSD models achieve the optimal learning rates, with concise models. However, we have seen in the operations section that multiplying two PSD models of m centers leads to a PSD model with m2 centers. Here we study the effect of compression, to show that it is possible to obtain an "-approximation of the product via a compressed model with O(m polylog(1/")) centers. In the following theorem we analyze the effect in terms of the L1 distance on a domain [ 1, 1]d, induced by the compression, when using points taken independently and uniformly at random from the same domain. +, X 2 Rn d, 2 Rd ++, we want to study the compressibility of the PSD model p(x) = f(x ; A, X, ). Let X 2 Rm d be the base point matrix whose j-rows corresponds to the point xj with x1, . . . , xm be sampled independently and uniformly at random from [ 1, 1]d. Denote by pm(x) the PSD model pm(x) = f(x ; A, X, ) where A is the compression of A via (9). We have the following theorem. Theorem 8 (Compression of Gaussian PSD models). Let δ 2 (0, 1], + = max(1, maxi=1,...,d i). When m satisfies + log k Akn then the following holds with probability at least 1 δ, |p(x) pm(x)| "2 + " p(x), 8x 2 [ 1, 1]d, (15) The proof of the theorem above is in Appendix C.1. To understand its relevance, let ˆp1 be a PSD model trained via (12) on n points sampled from p1 and ˆp2 trained from n points sampled from p2, where both p1, p2 satisfy Assumption 1 for the same β and m, λ, are chosen as Thm. 7, in particular m = nd/(2β+d) and = +1d, + = n2/(2β+d). Consider the model ˆp = ˆp1 ˆp2. By construction ˆp has m2 = n2d/(2β+d) centers, since it is the pointwise product of ˆp1, ˆp2 (see Prop. 2) and approximates p1 p2 with error " = n β/(2β+d) polylog(n), since both ˆp1, ˆp2 are "-approximators of p1, p2. Instead, by compressing ˆp, we obtain an estimator p, that according to Thm. 8, achieves error " with a number of center m0 = O( d/2 + polylog(1/")) = O(nd/(2β+d) polylog(1/")) = O(m polylog(1/")). (16) Then p approximates p1 p2 at the optimal rate n β/(2β+d), but with a number of centers m0 that is only O(m polylog(n)), instead of m2. This means that p is essentially as good as if we learned it from n samples taken directly from p1 p2. This renders compression a suitable method to reduce the computational complexity of costly inference operations as the product rule. 4 Applications PSD models are a strong candidate in a variety of probabilistic settings. On the one hand, they are computationally amenable to performing key operations such as sum and product rules, similarly to mixture models (Sec. 2.1). On the other hand, they are remarkably flexible and can approximate/learn (coincisely) a wide family of target probability densities (Sec. 3). Building on these properties, in this section we consider different possible applications of PSD models in practice. 4.1 PSD Models for Decision Theory Decision theory problems (see e.g. [5] and references therein) can be formulated as a minimization L( ) = Ex p ( , x), (17) where is a loss function, is the space of target parameters (decisions) and p is the underlying data distribution. When we can sample directly from p e.g. in supervised or unsupervised learning settings we can apply methods such as stochastic gradient descent to efficently solve (17). However, in many applications, sampling from p is challenging or computationally unfeasible. This is for instance the case when p has been obtained via inference (e.g. it is the t-th estimate in a hidden Markov model, see Sec. 4.3) or it is fully known but has a highly complex form (e.g. the dynamics of a physical system). In contexts where sampling cannot be performed efficiently, it is advisable to consider alternative approaches. Here we propose a strategy to tackle (17) when p can be modeled (or well-approximated) by a PSD model. Our method hinges on the following result. Proposition 9. Let p(x) = f(x ; A, X, ) with X 2 Rn d, A 2 Sn ++. Let g : Rd ! R and define cg, (z) = g(x)e kx zk2 dx for any z 2 Rd. Then Ex p g(x) = Tr( (A KX,X, /2) G ) with Gij = cg,2 Thanks to Prop. 9 we can readily compute several quantities related to a PSD model such as its mean Ep[x] = X>b with b = (A KX, /2)1n, its covariance or its characteristic function (see Appendix F for the explicit formulas and derivations). However, the result above is particularly useful to tackle the minimization in (17). In particular, since r L( ) = Ex pr ( , x), we can use Prop. 9 to directly compute the gradient of the objective function: it is sufficient to know how to evaluate (or approximate) the integral cr , (z) = r ( , x)e kx zk2 dx for any 2 and z 2 Rd. Then, we can use first order optimization methods, such as gradient descent, to efficiently solve (17). Remarkably, this approach works well also when we approximate p with a PSD model ˆp. If is convex, since ˆp is non-negative, the resulting ˆL( ) = Ex ˆp ( , x) is still a convex functional (see also the discussion on structured prediction in Sec. 4.2). This is not the case if we use more general estimators of p that do not preserve non-negativity. 4.2 PSD Models for Estimating Conditional Probabilities In supervised learning settings, one is typically interested in solving decision problems of the form min 2 E(x,y) p (h (x), y) where p is a probability over the joint input-output space X Y and h : X ! Y is a function parameterized by . It is well-known [see e.g. 34] that the ideal solution of this problem is the such that for any x 2 X the function h (x) = argminz2Y Ey p( |x) (z, y) is the minimizer with respect to z 2 Y of the conditional expectation of (z, y) given x. This leads to target functions that capture specific properties of p, such as moments. For instance, when is the squared loss, h (x) = Ey p( |x)y corresponds to the conditional expectation of y given x, while for the absolute value loss, h recovers the conditional median of p. In several applications, associating an input x to a single quantity h (x) in output is not necessarily ideal. For instance, when p(y|x) is multi-modal, estimating the mean or median might not yield useful predictions for the given task. Moreover, estimators of the form h require access to the full input x to return a prediction, and therefore cannot be used when some features are missing (e.g. due to data corruption). In these contexts, an alternative viable strategy is to directly model the conditional probability. When using PSD models, conditional estimation can be performed in two steps, by first modeling the joint distribution p(y, x) = f(y, x ; A, [Y, X], ( , 0)) (e.g. by learning it as suggested Sec. 3) and then use the operations in Sec. 2.1 to condition it with respect to x0 2 X as p(y|x0) = p(y, x0) p(x0) = f(y ; B, Y, ) with B = A (KX,x0, 0K> A (KX,x0, 0K> X,x0, 0)KY,Y, In case of data corruption, it is sufficient to first marginalize p(y, x) on the missing variables and then apply (19). Below we discuss a few applications of the conditional estimator. Conditional Expectation. Conditional mean embeddings [31] are a well-established tool to efficiently compute the conditional expectation Ey p( |x0) g(y) of a function g : Y ! R. However, although they enjoy good approximation properties [16], they to not guarntee the resulting estimator to take only non-negative values. In contrast, when p is a PSD model (or an approximation), we can apply Prop. 9 to p( |x0) in (19) and evaluate the conditional expectation of any g for which we know how to compute (or approximate) the integral cg, (z) = g(y)e ky zk2 dy. In particular we have Ey p( |x) g(y) = Tr((B KY,Y, /2)G) with B as in (19) and G the matrix with entries Gij = cg,2 ( yi+yj 2 ). Remarkably, differently from conditional mean embeddings estimators, this strategy allows us to compute the conditional expectations also of functions g not in H . Structured Prediction. Structured prediction identifies supervised learning problems where the output space Y has complex structures so that it is challenging to find good parametrizations for h : X ! Y [4, 19]. In [7], a strategy was proposed to tackle these settings by first learning an approximation (z, x) Ey p( |x) (z, y) and then model h (x) = argminz2Y (z, x). However, the resulting function ( , x) is not guaranteed to be convex, even when is convex. In contrast, by combining the conditional PSD estimator in (19) with the reasoning in Sec. 4.1, we have a strategy that overcomes this issue: when p( |x) is a PSD model approximating p, we can compute its gradient Algorithm 1 PSD Hidden Markov Model Input: Transition ˆ (x+, x) = f(x+, x ; B, [X+, X], ( +, )), initial ˆp(x0) = f(x0 ; A0, X0, 0) and observation ˆ!(y, x) = f(y, x ; C, [Y, X0], ( obs, 0)) distributions. X as in Prop. 10. = ( 0+ +) + 0+ + . X0 = + 0+ + X 1nm + 0+ + + 0+ + 1n X, and 0 = 0 + 0+ + For any new observation yt: Y,yt, obs KY,yt, obs) // Partial evaluation ˆ!t(x+) = ˆ!(y = yt, x+) Bt = (B At 1) vec(KX, X, )vec(KX, X, )>" // Product ˆβt(x+, x) = ˆ (x+, x)ˆp(x|y1:t 1) Dt = (In 1> nm)(Bt K X0, X0, 2 )(In 1nm) // Marginalization ˆβt(x+) = βt(x+, x) dx Et = (Ct Dt) vec(KX+,X0, 0 2 )vec(KX+,X0, 0 // Product ˆ t(x+) = ˆ!t(x+)ˆβt(x+) At = Et/ct, with ct = c2( 0+ +)Tr(Et K X, X, ) // Normalization ˆp(x+|y1:t) = Return f(xt ; At, X, 0 + +) Ey p( |x) rz (z, y) as mentioned in Sec. 4.1 using Prop. 9. Moreover, if ( , y) is convex, the term Ey p( |x) ( , y) is also convex, and we can use methods such as gradient descent to find h(x) exactly. Mode Estimation. When the output distribution p(y|x) is multimodal, having access to an explicit form for the conditional density can be useful to estimate its modes. This problem is typically non-convex, yet, when the output y belongs to a small dimensional space (e.g. in classification or scalar-valued regression settings), efficient approximations exist (e.g. bisection). 4.3 Inference on Hidden Markov Models We consider the problem of performing inference on hidden Markov models (HMM) using PSD models. Let (xt)t2N and (yt)t2N denote two sequences of states and observations respectively. For each t 1, we denote by x0:t = x0, . . . , xt and y1:t = y1, . . . , yt and we assume that p(xt|x0:t 1, y1:t 1) = p(xt|xt 1) = (xt, xt 1) and p(yt|x0:t, y1:t 1) = p(yt|xt) = !(yt, xt) with : X X ! R+ and ! : Y X ! R+ respectively the transition and observation functions. Our goal is to infer the distribution of possible states xt at time t, given all the observations y1:t and a probability p(x0) on the possible initial states. We focus on this goal for simplicity, but other forms of inferences are possible (e.g. estimating xm+t, namely m steps into the future or the past). We assume that the transition and observation functions can be well approximated by PSD models ˆ and ˆ! (e.g. by learning them or known a-priori for the problem s dynamics). Then, given a PSD model estimate ˆp(x0) of the initial state p(x0), we can recursively define the sequence of estimates ˆp(xt|yt:1) = ˆ (yt, xt) !(xt, xt 1)ˆp(xt|y1:t 1) dxt 1 R ˆ (yt, xt)!(xt, xt 1)ˆp(xt|y1:t 1) dxtdxt 1 Note that when ˆ , ˆ!, ˆp(x0) correspond to the real transition, observation and initial state probability, the formula above yields the exact distribution p(xt|y1:t) over the states at time t (this follows directly by subsequent applications of Bayes rule. See also e.g. [5]). If ˆ , ˆ!, ˆp(x0) are PSD models, then each of the ˆp(xt|yt:1) is a PSD model recursively defined only in terms of the previous estimate and the operations introduced in Sec. 2.1. In particular, we have the following result. Proposition 10 (PSD Hidden Markov Models (HMM)). Let X0 2 Rn0 d, X+, X 2 Rn d, X0 2 Rm d, Y 2 Rm d0, A0 2 Sn0 + and 0, , 0, + 2 Rd ++, obs 2 Rd0 ˆ (x+, x) = f(x+, x ; B, [X+, X], ( +, )), ˆ!(y, x) = f(y, x ; C, [Y, X0], ( obs, 0)), (21) be approximate transition and observation functions. Then, given the initial state probability ˆp(x0) = f(x0 ; A0, X0, 0), for any t 1, the estimate ˆp in (20) is a PSD model of the form ˆp(xt|yt:1) = f(xt ; At, X, 0 + +), (22) where X = 0 0+ + X0 1n + + 0+ + 1m X+ and At is recursively obtained from At 1 as in Alg. 1. Remark 4 (Sum-product Algorithm). Eq. (20) is an instance of the so-called sum-product algorithm, a standard inference method for graphical models [5] (of which HMMs are a special case). The application of the sum-product algorithm relies mainly on sum and product rules for probabilities (as is the case for HMMs in (20)). Hence, according to Sec. 2.1, it is highly compatible with PSD models. 5 Discussion In this work we have shown that PSD models are a strong candidate in practical application related to probabilistic inference. They satisfy both requirements for an ideal probabilistic model: i) they perform exact sum and product rule in terms of efficient matrix operations; ii) we proved that they can concisely approximate a wide range of probabilities. Future Directions. We identify three main directions for future work: i) when performing inference on large graphical models (see Remark 4) the multiplication of PSD models might lead to an inflation in the number of base points. Building on our compression strategy, we plan to further investigate low-rank approximations to mitigate this issue. ii) An interesting problem is to understand how to efficiently sample from a PSD model. A first answer to this open question was recently given in [15]. iii) The current paper has a purely theoretical and algorithmic focus. In the future, we plan to investigate the empirical behavior of PSD models on the applications introduced in Sec. 4. Related to this, we plan to develop a library for operations with PSD models and make it available to the community. Acknowledgments. A.R. acknowleges the support of the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and the support of the European Research Council (grant REAL 947908). C.C. acknowledges the support of the Royal Society (grant SPREM RGS\R1\201149) and Amazon.com Inc. (Amazon Research Award ARA 2020) [1] Robert A Adams and John JF Fournier. Sobolev spaces. Elsevier, 2003. [2] Jason Altschuler, Francis Bach, Alessandro Rudi, and Jonathan Niles-Weed. Massively scalable sinkhorn distances via the nyström method. ar Xiv preprint ar Xiv:1812.05189, 2018. [3] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathemati- cal Society, 68(3):337 404, 1950. [4] G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan. Predicting Structured Data. MIT Press, 2007. [5] Christopher M Bishop. Machine learning and pattern recognition. Information Science and Statistics. Springer, Heidelberg, 2006. [6] René Erlín Castillo and Humberto Rafeiro. An introductory course in Lebesgue spaces. Springer, [7] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A general framework for consistent structured prediction with implicit loss embeddings. Journal of Machine Learning Research, 21(98):1 67, 2020. [8] Felipe Cucker and Ding Xuan Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, 2007. [9] Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media, 2012. [10] Alexander Goldenshluger and Oleg Lepski. On adaptive minimax density estimation on rd. Probability Theory and Related Fields, 159(3-4):479 543, 2014. [11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [12] Lars Hörmander. The analysis of linear partial differential operators II: Differential operators with constant coefficients. Springer, 1990. [13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Confer- ence on Learning Representation (ICLR, 2014. [14] Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Non-parametric models for non- negative functions. In Advances in neural information processing systems, 2020. [15] Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Sampling from arbitrary functions via psd models. ar Xiv preprint ar Xiv:2110.10527, 2021. [16] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends(r) in Machine Learning Series, 2017. [17] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012. [18] Erich Novak. Deterministic and stochastic error bounds in numerical analysis. Springer, 2006. [19] Sebastian Nowozin, Christoph H Lampert, et al. Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision, 2011. [20] Nicolò Pagliana, Alessandro Rudi, Ernesto De Vito, and Lorenzo Rosasco. Interpolation and learning with scale dependent kernels. ar Xiv preprint ar Xiv:2006.09984, 2020. [21] Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065 1076, 1962. [22] Christian Rieger and Barbara Zwicknagl. Sampling inequalities for infinitely smooth func- tions, with applications to interpolation and machine learning. Advances in Computational Mathematics, 32(1):103, 2010. [23] Luke Gervase Rogers. A degree-independent Sobolev extension operator. Yale University, 2004. [24] Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding global minima via kernel approximations. ar Xiv preprint ar Xiv:2012.11978, 2020. [25] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In NIPS, pages 3215 3225, 2017. [26] Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002. [27] John Shawe-Taylor and N Cristianini. An introduction to support vector machines and other kernel-based learning methods, volume 204. Volume, 2000. [28] Winfried Sickel. Composition operators acting on sobolev spaces of fractional order a survey on sufficient and necessary conditions. Function spaces, differential operators and nonlinear analysis (Paseky nad Jizerou, 1995), pages 159 182, 1996. [29] Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnabás Póczos. Nonparametric density estimation with adversarial losses. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 10246 10257. Curran Associates Inc., 2018. [30] Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory, pages 13 31. Springer, 2007. [31] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine, 30(4):98 111, 2013. [32] Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyvärinen, and Revant Kumar. Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research, 18, 2017. [33] Elias M Stein and Guido Weiss. Introduction to Fourier Analysis on Euclidean Spaces (PMS-32), Volume 32. Princeton university press, 1971. [34] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer New York, 2008. [35] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008. [36] Paxton Turner, Jingbo Liu, and Philippe Rigollet. A statistical perspective on coreset density estimation. In International Conference on Artificial Intelligence and Statistics, pages 2512 2520. PMLR, 2021. [37] Adrien Vacher, Boris Muzellec, Alessandro Rudi, Francis Bach, and Francois-Xavier Vialard. A dimension-free computational upper-bound for smooth optimal transport estimation. ar Xiv preprint ar Xiv:2101.05380, 2021. [38] Matt P Wand and M Chris Jones. Kernel smoothing. CRC press, 1994. [39] Holger Wendland. Scattered Data Approximation, volume 17. Cambridge University Press, [40] Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems, pages 682 688, 2001. [41] Vadim Yurinsky. Sums and Gaussian vectors. Springer, 1995.