# fisher_efficient_inference_of_intractable_models__a44b1806.pdf

Fisher Efﬁcient Inference of Intractable Models

Song Liu University of Bristol The Alan Turing Institute, UK song.liu@bristol.ac.uk

Takafumi Kanamori Tokyo Institute of Technology, RIKEN, Japan kanamori@c.titech.ac.jp

Wittawat Jitkrittum Max Planck Institute for Intelligent Systems, Germany wittawat@tuebingen.mpg.de

Yu Chen University of Bristol, UK yc14600@bristol.ac.uk

Maximum Likelihood Estimators (MLE) has many good properties. For example, the asymptotic variance of MLE solution attains equality of the asymptotic Cramér Rao lower bound (efﬁciency bound), which is the minimum possible variance for an unbiased estimator. However, obtaining such MLE solution requires calculating the likelihood function which may not be tractable due to the normalization term of the density model. In this paper, we derive a Discriminative Likelihood Estimator (DLE) from the Kullback-Leibler divergence minimization criterion implemented via density ratio estimation and a Stein operator. We study the problem of model inference using DLE. We prove its consistency and show that the asymptotic variance of its solution can attain the equality of the efﬁciency bound under mild regularity conditions. We also propose a dual formulation of DLE which can be easily optimized. Numerical studies validate our asymptotic theorems and we give an example where DLE successfully estimates an intractable model constructed using a pre-trained deep neural network.

1 Introduction

Maximum Likelihood Estimation (MLE) has been a classic choice of density parameter estimator. It can be derived from the Kullback-Leibler (KL) divergence minimization criterion and the resulting algorithm simply maximizes the likelihood function (log-density function) over a set of observations. The solution of MLE has many attractive asymptotic properties: the asymptotic variance of MLE solutions reach an asymptotic lower bound of all unbiased estimators [5, 24].

However, learning via MLE requires evaluating the normalization term of the density function; it may be challenging to apply MLE to learn a complex model that has a computationally intractable normalization term. A partial solution to this problem is approximating the normalization term or the gradient of the likelihood function numerically. Many methods along this line of research have been actively studied: importance-sampling MLE [25], contrastive divergence [12] and more recently amortized MLE [33]. While the computation of the normalization term is mitigated, these samplingbased approximate methods come at the expense of extra computational burden and estimation errors.

The issue of intractable normalization terms has led to the develoment of other approaches other than the KL divergence minimization. For example, Score Matching (SM) [13] minimizes the Fisher divergence [26] between the data distribution and a model distribution which is speciﬁed by the gradient (with respect to the input variable) of its log density function. Its computation does not

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

require the evaluation of the normalization term, thus SM does not suffer from the intractability issue. Extensions of SM has been used for inﬁnite dimensional exponential family models [28], non-negative models [14, 35] and high dimensional graphical models ﬁtting [17].

Other than the Fisher divergence, a kernel-based divergence measure known as Kernel Stein Discrepancy (KSD) [4, 19] has been proposed as a test statistic for goodness-of-ﬁt testing to measure the difference between a data and a model distribution, without the hassle of evaluating the normalization term. It reformulates the kernel Maximum Mean Discrepancy (MMD) [9] with a Stein operator [29, 8, 23] which is also deﬁned using the gradient of the log density function. For the same reason as in SM, the KSD can be estimated when applied to a density model with an intractable normalizer. The last few years have seen many applications of KSD such as variational inference [18], sampling [23, 3], and score function estimation [16, 27] among others. KSD minimization is a natural candidate criterion for ﬁtting intractable models [2]. However, the divergence measure deﬁned by the KSD is directly characterized by the kernel used. Unlike in the case of goodness-of-ﬁt testing where the kernel may be chosen by maximizing the test power [15], to date, there is no clear objective for choosing the right kernel in the case of model ﬁtting.

By contrast, KL divergence has been a classic discrepancy measure for model ﬁtting. The question that we address is: can we construct a generic model inference method by minimizing the KL divergence without the knowledge of the normalization term? In this paper, we present a novel unnormalized model inference method, Discriminative Likelihood Estimation (DLE), by following the KL divergence minimization criterion. The algorithm uses a technique called Density Ratio Estimation [31] which is conventionally used to estimate the ratio between two density functions from two sets of samples. We adapt this method to estimate the ratio between a data and an unnormalized density model with the help of a Stein operator. We then use the estimated ratio to construct a surrogate to KL divergence which is later minimized to ﬁt the parameters of an unnormalized density function. The resulting algorithm is a min max problem, which we show can be conveniently converted into a min-min problem using Lagrangian duality. No extra sampling steps are required.

We further prove the consistency and asymptotic properties of DLE under mild conditions. One of our major contributions is that we prove the proposed estimator can also attain the asymptotic Cramér-Rao bound. Numerical experiments validate our theories and we show DLE indeed performs well under realistic settings.

2 Background

2.1 Problem: Intractable Model Inference via KL Divergence Minimization

Consider the problem of estimating the parameter θ of a probability density model p(x; θ) from a set of i.i.d. samples: Xq := {x(i) q }nq i=1 i.i.d. Q where Q is a probability distribution whose density function is q(x). One idea is minimizing the sample approximated KL divergence from pθ to q:

min θ KL [q|pθ] = min θ Eq

log q(x) p(x; θ)

= C max θ Eq [log p(x; θ)]

C max θ 1 nq

i=1 log p(x(i) q ; θ),

where C is a constant that does not depend on θ. The last line uses Xq to approximate the expectation over q(x). This technique is known as Maximum Likelihood Estimation (MLE).

Despite many advantages, MLE is unﬁt for intractable model inference. Consider for instance a density model p(x; θ) := p(x;θ)

z(θ) , where p(x; θ) is a positive multilayer neural network parametrized by θ, Z(θ) = R p(x; θ)dx is the normalization term which guarantees that p(x; θ) integrates to 1 over its domain. In this example, Z(θ) does not have a computationally tractable form; therefore, MLE cannot be used without approximating the likelihood function or its gradient using numerical methods such as Markov chain Monte Carlo (MCMC).

However, there is an alternative approach to minimizing the KL divergence: KL [q|pθ] is an expectation of the log-ratio log q(x) p(x;θ) with respect to the data distribution q(x). If we have access to q(x) p(x;θ), we can approximate this KL by taking the average of the density ratio function over samples Xq, and

the density model parameter θ can be subsequently estimated by minimizing this approximation to the KL divergence.

2.2 Two Sample Density Ratio Estimation

Traditionally, Density Ratio Estimation (DRE) [30, 31] refers to estimating the ratio of two unknown densities from their samples. Given two sets of i.i.d. samples drawn separately from distributions Q and P: Xq := {x(i) q }nq i=1 Q, Xp := {x(i) p }np i=1 P, xq, xp Rd, where distribution Q and P have density functions q(x) and p(x) respectively. We hope to estimate the ratio q(x)

We can model the density ratio using a function r(x; δ) parameterized by δ. To obtain the parameter δ, we minimize the KL divergence KL[q|qδ] where q(x; δ) := r(x; δ)p(x):

min δ KL[q|qδ] s.t. R r(x; δ)p(x)dx = 1. (1)

KL[q|qδ] comprises three terms in which only one term is dependent on the parameter δ:

KL[q|qδ] = Eq[log q(x)] Eq[log r(x; δ)] Eq[log p(x)] 1

nq Pnq i=1 log r(x(i) q ; δ) + C, (2)

The last step uses Xq to approximate the expectation over q(x). C is a constant irrelevant to δ. We can also approximate the equality constraint in (1) using Xp: R r(x; δ)p(x)dx 1 np Pnp j=1 r(x(j) p ; δ). (3)

Combining (2) and (3), we get a sample version of (1):

ˆδ := argmin δ 1 nq Pnq i=1 log r(x(i) q ; δ) + C s.t. 1

np Pnp j=1 r(x(j) p ; δ) = 1. (4)

The above optimization is called Kullback Leibler Importance Estimation Procedure (KLIEP) [30]. Unfortunately, it cannot be directly used to estimate our ratio q(x) p(x;θ) since we only have samples from q(x) but not from p(x; θ). Consequently the equality constraint R r(x; δ)p(x; θ)dx = 1 can no longer be approximated using samples.

A natural remedy to this problem is to draw samples from p(x; θ) using sampling techniques, such as MCMC which, in general, can be costly when p(x; θ) is complex. Correlation among drawn samples from an MCMC scheme further complicates estimation of the ratio. More importantly, regardless of the feasibility of sampling from p(x; θ), the availability of an explicit (possibly unnormalized) density p(x; θ) is much more valuable than just samples, especially in a high dimensional space where samples rarely capture the ﬁne-grained structural information present in the density model p(x; θ).

In this work, we propose a new procedure Stein Density Ratio Estimation which can directly use the (unnormalized) density p, as it is, without sampling from it. Moreover, the new procedure (described in Section 3.1) yields a density ratio model rθ(x; δ) for the ratio function q(x) p(x;θ) that automatically satisﬁes the aforementioned equality constraint for all θ.

3 Stein Density Ratio Estimation

Let us consider a linear-in-parameter density ratio model r(x; δ) := δ f(x), where f(x) is a feature function that transforms a data point x into a more powerful representation. To better model q(x) p(x;θ), we deﬁne a family of feature functions called Stein features.

3.1 Stein Features

Suppose we have a feature function f(x) : Rd Rb and a density model p(x; θ) : Rd R. A Stein feature Tθf(x) Rb with respect to p(x; θ) is Tθf(x) := [Tθf1(x), . . . , Tθfi(x), . . . , Tθfb(x)] , where Tθ is a Stein operator [29, 8, 4, 23] and Tθfi(x) R is deﬁned as

Tθfi(x) := x log p(x; θ), xfi(x) + trace( 2 xfi(x)),

where fi is the i-th output of function f, xfi(x) is the gradient of fi(x) and 2 xfi(x) is the Hessian of fi(x). Note that computing Tθf(x) does not require evaluating the normalization term Z(θ) as

x log p(x; θ) = x log p(x; θ) x log Z(θ), where x log Z(θ) = 0.

Example 1. Let p(x; θ) be in exponential family with sufﬁcient statistic ψ(x), then

Tθfi(x) = θ Jxψ(x) xfi(x) + trace[ 2 xfi(x)],

where Jxψ(x) Rdim(θ) d is the Jacobian of ψ(x) and dim(θ) is the dimension of θ .

One more example can be found at Appendix, Section A.1. A slightly different Stein operator was introduced in [4, 23] where T θf(x) R for f(x) Rd is deﬁned as T θf(x) := Pd i=1 [ xi log p(x; θ)] fi(x)+ xifi(x), where xif(x) is the partial derivative of f(x) with respect to xi. We can see the relationship between T and T : Tθfi(x) = T θ xfi(x). Next we show an important property of Stein features.

Proposition 1 (Stein s Identity). Suppose p(x; θ) > 0,

i,j lim |xj| p(x1, , xj, , xd; θ) xjfi(x1, , xj, , xd) = 0,

p(x; θ) is continously differentiable and fi is twice continuously differentiable for all θ and i. Then Epθ[Tθf(x)] = 0 for all θ.

We give a proof in Appendix Section B.1. Similar identities were given in previous literatures such as Lemma 2.2 in [19] or Lemma 5.1 in [4]. Utilizing this property, we can construct a density ratio model which bypasses the intractable equality constraint issue when estimating q(x) p(x;θ) as shown in the next section.

3.2 Stein Density Ratio Modeling and Estimation (SDRE)

Deﬁne a linear-in-parameter density ratio model: rθ(x; δ) := δ Tθf(x)+1 by using a Stein feature function. We can see that Epθ[rθ(x; δ)] = Epθ[δ Tθf(x) + 1] = δ Epθ[Tθf(x)] + 1 = 1 where the last equality is ensured by Proposition 1 for all δ and θ if the speciﬁed regularity conditions are met. This equality means the constraint in (1) is automatically satisﬁed with this density ratio model. Now we can solve (4) without its equality constraint.

ˆδ := argmin δ Rb 1

i=1 log rθ(x(i) q ; δ) + C = argmax δ Rb 1 nq

i=1 log h δ Tθf(x(i) q ) + 1 i . (5)

It can be seen that (5) is an unconstrained concave maximization problem. Note for all xq Xq, rθ(xq; ˆδ) must be strictly positive thanks to the log-barrier (see e.g., Section 17.2 in [22]) in our objective function. However, it is not possible to guarantee that for all x Rd, rθ(x; ˆδ) is positive. This is not a problem in this paper, as the density ratio function is only used for approximating the KL divergence, and we will not evaluate rθ(x; ˆδ) at a data point x that is outside of Xq. Note, the unnormalized density model p(x; θ), by deﬁnition, should be non-negative everywhere for all θ.

We refer to the objective (5) as Stein Density Ratio Estimation (SDRE). One may notice that 1 nq Pnq i=1 log rθ(x(i) q ; ˆδ) evaluated at ˆδ is exactly the sample average of the estimated ratio over Xq which allows us to approximate the KL divergence from p(x; θ) to q(x).

4 Intractable Model Inference via Discriminative Likelihood Estimation

Let ℓ(ˆδ, θ) := 1 nq Pnq i=1 log rθ(x(i) q ; ˆδ). We will use ℓ(ˆδ, θ) as a replacement of KL [q(x)|p(x; θ)]. The rationale of minimizing KL divergence from p(x; θ) to q(x) leads to:

min θ ℓ(ˆδ, θ) or equivalently min θ max δ ℓ(δ, θ). (6)

The equivalence is due to the fact that ℓevaluated at the optimal ratio parameter ˆδ is also the maximum of the DRE objective function when being optimized w.r.t. δ. The outer problem minimizes ℓwith respect to the density parameter θ. We call this estimator Discriminative Likelihood Estimation (DLE) as the parameter of the density model p(x; θ) is learned via minimizing a discriminator1, which is the likelihood ratio function ℓ(ˆδ, θ) measuring the differences between q(x) and p(x; θ).

4.1 Consistency with Correct Model

For brevity, we state all theorems assuming all regularity conditions in Proposition 1 are met.

Notations: H is 2 (δ,θ)ℓ(δ, θ), the full Hessian of ℓ(δ, θ). Hδ,θ is δ θℓ(δ, θ), submatrix of the

Hessian matrix whose rows and columns indexed by δ, θ respectively. s Rdim(θ) is θ log p(x, θ) evaluated at θ , score function of p(x; θ). λ( ) is the eigenvalue operator. λmin( ) or λmax( ) is the minimum or maximum eigenvalue and is the operator norm.

We study the consistency of the following estimator under a correct model setting.

(ˆδ, ˆθ) := arg min θ Θ max δ nq ℓ(δ, θ), (7)

where Θ and nq are compact parameter spaces for θ and δ respectively. The compactness condition is among a set of conditions commonly used in classic consistency proofs (see e.g., Wald s Consistency Proof, 5.2.1,[32]). It is possible to derive weaker conditions given speciﬁc choices of f or p(x; θ). However, in the current manuscript, we only focus on more generic settings and conditions that would give rise to estimation consistency and useful asymptotic theories. We assume they are properly chosen so that (ˆθ, ˆδ) is the saddle point of (7).

First, we assume our density model p(x; θ) is correctly speciﬁed: Assumption 1. There exists a unique pair of parameter (θ , δ ), θ Θ, δ nq, such that p(x; θ ) q(x) and rθ (x; δ ) = 1.

Given how rθ(x; δ) is constructed in Section 3.2, the above assumption implies δ must be 0. Assumption 2. There exist constants Λmin > 0, Λ min > 0 and Λmax > 0 so that θ Θ, δ nq

λmin { Hδ,δ} Λ min, Λmax Hθ,δH 1 δ,δ , λmin n Hθ,δH 1 δ,δHδ,θ o Λmin > 2 Hθ,θ .

The lower-boundedness of λmin { Hδ,δ} implies the strict concavity of ℓ(δ, θ) with respect to δ (ℓ(δ, θ) is already concave by construction, see (5)): For all θ Θ, there exists a unique ˆδ(θ) that maximizes the likelihood ratio, which means the likelihood ratio function should always have sufﬁcient discriminative power to precisely pinpoint the differences between our data and the current model θ. It also ensures that δ can teach the model parameter θ well by assuming the interaction between δ and θ in our estimator, Hθ,δ, is well-behaved.

Now we analyze Assumption 2 on a special case:

Proposition 2. Let nq := δ 1 Cratio rθ(x; δ) Cratio, δ 2 T/σ(nq), θ Θ, x Xq,

where T > 0, Cratio > 1 are constants and σ( ) is a monotone-increasing function. p(x; θ) is in exponential family with sufﬁcient statistic ψ(x) and Stein feature is chosen as Tθψ(x). Suppose there exist constants C2, C3, C4, C5, Λ max, Λ min > 0, C2 1 nq Pnq i=1 Jxψ(x(i) q ) 4,

i=1 Jxψ(x(i) q )Jxψ(x(i) q ) )

i=1 Jxψ(x(i) q )Jxψ(x(i) q ) C4,

i=1 Jxψ(x(i) q )Jxψ(x(i) q ) Tθψ(x(i) q ) C5

and Λ max λ 1 nq Pnq i=1 Tθψ(x(i) q )Tθψ(x(i) q ) Λ min, θ Θ with high probability. There exists a constant N > 1, when nq N, Assumption 2 holds with high probability.

1The word discriminator is borrowed from GAN [7]. Indeed, DLE and GAN bears many resemblances.

The proof can be found in Appendix, Section B.3. Note in practice the domain constraint of nq in this proposition can be easily enforced via convex constraints or penalty terms. Analysis on a few other examples can be found in Appendix, Section A.2.

Proposition 2 gives us some hints on how the feature function f of Stein feature can be chosen. In the case of exponential family, the choice f = φ guarantees Assumption 2 to hold with high probability when nq increases. Assumption 3 (Concentration of Stein features). The difference between the sample average of the Stein feature Tθ f(x) and its expectation over q converges to zero in ℓ2 norm in probability. 1

nq Pnq i=1 Tθ f(x(i) q ) Eq [Tθ f(x)] 2

Note, if Assumption 1 holds at the same time, Proposition 1 indicates Eq [Tθ f(x)] 0. This assumption holds due to the (strong) law of large numbers given that the Eq [Tθ f(x)] exists.

Theorem 1 (Consistency). Suppose Assumption 1, 2 and 3 holds, (ˆδ, ˆθ) P (0, θ ).

See Section B.4 in Appendix for the proof. This theorem states that as nq increases, all saddle points of (7), converge to the vicinity of true parameters. All the following theorems rely on the result of Theorem 1.

4.2 Asymptotic Variance of ˆθ and Fisher Efﬁciency of DLE

In this section we state one of our main contributions: DLE can attain the efﬁciency bound, i.e., asymptotic Cramér-Rao bound when f(x) is appropriately chosen. First, we show our estimator ˆθ has a simple asymptotic distribution which allows us to perform model inference. To state the theorem, we need an extra assumption on the Hessian H:

Assumption 4 (Uniform Convergence on H). supδ nq ,θ Θ |Hi,j Eq [Hi,j]| P 0, i,j.

This assumption states the second order derivatives (which is an average over samples from Xq) converges uniformly to its population mean, as nq . It helps us control the residual in the second order Taylor expansion in our proof. This assumption may be weakened given speciﬁc choices of f and p(x; θ) but we focus on establishing the asymptotic results in generic settings, so this condition is only listed as an assumption.

Theorem 2 (Asymptotic Normality of ˆθ). Suppose Assumption 1, 2, 3 and 4 holds,

nq θ ˆθ N [0, V ] , V = Eq H θ,δ Eq H δ,δ 1Eq H δ,θ 1 , (8)

where H is H evaluated at (δ , θ ).

See Section B.5 in Appendix for the proof. In practice, we do not know Eq [H ], so we may use ˆH, the Hessian of ℓ(δ, θ) evaluated at (ˆδ, ˆθ) as an approximation to Eq [H ].

Although MLE is also asymptotically normal, important quantities such as Fisher Information Matrix may not be efﬁciently computed on intractable models. In comparison, Theorem 2 enables us to compute parameter conﬁdence interval for DLE even on intractable pθ.

Now we consider the asymptotic efﬁciency of the DLE with respect to speciﬁc choices of Stein features. Let V f be the asymptotic variance (8) using a Stein feature with a speciﬁc choice of f.

Lemma 3. Suppose that Assumptions 1, 2, 3 and 4 hold and Eq[Tθ f(x)Tθ f(x) ] is invertible. Moreover, suppose that the integration and the derivative of θi R p(x; θ)Tθf(x)dx is exchangeable for all i. Vf = Eq[s Tθ f(x) ]Eq[Tθ f(x)Tθ f(x) ] 1Eq[Tθ f(x)s ] 1 .

The proof is given in Section B.6 in the Appendix. Lemma 3 expresses asymptotic variance using score function and Stein feature and is used to prove that the variance monotonically decreases as the vector space spanned by the Stein feature vectors becomes larger. Corollary 4 (Monotonocity of Asymptotic Variance). Deﬁne the inner product as Eq[fg] for functions f and g. Let Tθ f(x) = [t1, . . . , tb] and Tθ f(x) = [ t1, . . . , t b] be two Stein feature vectors. Assume that span{t1, . . . , tb} span{ t1, . . . , t b}, where span{ } denotes the linear space

spanned by the speciﬁed elements. Then, the inequality V f Vf holds in the sense of the positive deﬁniteness.

Proof. Let us deﬁne Pfs as the orthogonal projection of s onto span{t1, . . . , tb}. A simple calculation yields Pfs = Eq[s Tθ f(x) ]Eq[Tθ f(x)Tθ f(x) ] 1Tθ f(x), and thus, Lemma 3 leads to V 1 f = Eq[Pfs(Pfs) ]. From the property of the orthogonal projection (see e.g., Theorem 2.23 in [34]), we have Eq[P fs(P fs) ] Eq[Pfs(Pfs) ]. Therefore, we obtain V f Vf.

For Qfs = s Pfs, we have Eq[ss ] = Eq[Pfs(Pfs) ] + Eq[Qfs(Qfs) ] = V 1 f + Eq[Qfs(Qfs) ]. Thus, we see that the asymptotic variance converges to the inverse of the Fisher information, Eq[ss ] 1, as Pfs gets close to s. In particular, when the linear space span{t1, . . . , tb} includes s, Qfs vanishes and consequently the DLE with f(x) is asymptotically efﬁcient.

Example 2. Let p(x; θ) be the model of the d-dimensional multivariate Gaussian distribution N(θ, Iden σ2), where Iden is the identity matrix. Here the variance σ2 is assumed to be known. The score function is sj(x; θ) = (xj θj)/σ2, and the Stein feature vector deﬁned from f(x) = x is (Tθx)j = (xj θj)/σ2 for j = 1, . . . , d. Clearly, the score function is included in span{t1, . . . , td}. Hence, the DLE with f achieves the efﬁciency bound of the parameter estimation.

One more example can be found in Appendix, Section A.3. In fact, Corollary 3 suggests that as long as we can represent the score function s using Stein feature Tθf up to a linear transformation, DLE can achieve efﬁciency bound. However, since f is coupled with x log p(x; θ) in Tθf, it is not always easy to reverse engineer an f from s. Nonetheless, our numerical experiments show that using simple functions such polynomials as f yields good performance.

4.3 Model Selection of DLE

As our objective (6) tries to minimize the discrepancy between our model p(x; θ) and the data distribution, it is tempting to compare models using the objective function evaluated at (ˆδ, ˆθ), i.e., ℓ(ˆδ, ˆθ). However, the more sophisticated p(x; θ) becomes, the more likely it picks up spurious patterns of our dataset. Similarly, the more powerful the Stein features are, the more likely the discriminator is overly critical to the density model on this dataset. Thus a better model selection criterion would be comparing Eq h ℓ(ˆδ, ˆθ) i which eliminates the potential of overﬁtting a speciﬁc dataset. Unfortunately, this expectation cannot be computed without the knowledge on q(x). We propose to approximate this quantity using a penalized likelihood:

Theorem 5. Suppose Assumption 1, 2, 3 and 4 holds. Eq H δ,δ and Eq H δ,θ are full-rank and

dim(θ) b, then nq Eq h ℓ(ˆδ, ˆθ) i = minθ maxδ nqℓ(δ, θ) b + dim(θ) + op(1).

See Section B.7 in Appendix for the proof. This theorem is closely related to a classic result called Akaike Information Criterion (AIC) [1]. Both AIC and Theorem 5 similarly penalize the degree of freedom of the density model dim(θ), while our theorem also penalizes the number of ratio parameter dim(δ) = b due to the fact that our ratio function is also ﬁtted using samples.

One can also show ℓ(ˆδ, ˆθ) follows a χ2 distribution. See Section B.8 in Appendix for details.

Theorem 5 provides an information-criterion based model selection method. Suppose M is a set of different Stein features and M is a set of candidate density models. We can jointly select density model and Stein feature: ( ˆm, ˆm ) := arg minm M maxm M Eq[ℓ(ˆθ(m ), ˆδ(m))], where (ˆθ(m ), ˆδ(m)) are estimated parameters under the model choice (m , m). Replacing Eq[ℓ(ˆθ(m ), ˆδ(m))] with the penalized likelihood derived in Theorem 5, we can get a practical model selection method.

5 Lagrangian Dual of SDRE and DLE by Minimization

Some techniques can be used to directly optimize the min-max problem in (6), such as performing gradient descend/ascend on θ and δ alternately. However, looking for the saddle points of a min-max optimization is hard. In this section, we derive a partial Lagrangian dual for (6) so we can convert this

(a) qqplot of marginals

(b) nqˆθ vs. C.I.

(c) Var(ˆθ), Gamma dist.

(d) Var(ˆθ), Gaussian mix.

Figure 1: Theoretical Prediction values vs. Empirical results

min-max problem into a min-min problem whose local optima can be efﬁciently found by existing optimization techniques. Proposition 3. SDRE problem in (5) has a Lagrangian dual:

ˆµ = argmin µ

i=1 [ (log µi) 1]

i=1 µi s.t. :

i=1 µi Tθf(x(i) q ) = 0. (9)

Moreover, the duality gap between (9) and (5) is 0 and rθ(x(i) q ; ˆδ) = 1/ˆµi.

See Section B.9 in the Appendix for its proof. Instead of solving the min-max problem (6), we solve the following constrained minimization problem:

min θ min µ

i=1 [ (log µi) 1]

i=1 µi, s.t. :

i=1 µi Tθf(x(i) q ) = 0, (10)

where we replace the inner max problem in (6) with its Lagrangian (9). All experiments in this paper are performed using the Lagrangian dual objective (10). See https://github.com/lamfeeling/ Stein-Density-Ratio-Estimation for code demos on SDRE and model inference. 6 Related Works

Score Matching (SM) [13, 14] is a inference method for unnormalized statistical models. It estimates model parameters by minimizing the Fisher divergence [20, 26] between the true log density and the model log density. To estimate the parameter, this method only requires x log p(x; θ) and 2 x log p(x; θ) to avoid evaluating the normalization term.

Kernel Stein Discrepancy (KSD) [2] is a kernel mean discrepancy measure between a data distribution and a model density using the Stein identity deﬁned on Stein operator T pθ. This measure has been used for model evaluation [4, 19]. In Section 7, we minimize such a discrepancy with respect to θ for unnormalized model parameter estimation. A more generic version of this estimator has been discussed in [2].

Noise Contrastive Estimation (NCE) [10] estimates the parameters of an unnormalized statistical model by performing a non-linear logistic regression to discriminate between observed dataset and artiﬁcially generated noise. The normalization term can be dealt with like a regular parameter in the statistical model and estimated through such a logistic regression. NCE requires us to select a noise distribution and in our experiments, we use a multivariate Gaussian distribution with mean and variance the same as Xq.

7 Experiments

7.1 Validation of Asymptotic Results

To examine the asymptotic distribution of nq[ˆθ θ ], we design an intractable exponential family model p(x; θ) := exp η(θ) ψ(x) , where

i=1 x2 i , x1x2,

i=3 x1xi, tanh(x)] , η(θ) := [ .5, .6, .2, 0, 0, 0, θ] , x R5, θ R2.

Figure 2: MNIST images with the highest (upper red box) and the lowest unnormalized density (lower green box) estimated on each digit by DLE and NCE.

tanh(x) is applied in an element-wise fashion. The feature function of the Stein feature is chosen as f(x) := tanh(x) R5. Due to the tanh function, p(x; θ) does not have a closed form normalization term. We draw nq = 500 samples from p(x; 0) as Xq. Given we set θ = 0, Xq actually comes from a tractable distribution. However the intractability of p(x; θ) does not allow us to perform MLE straight away.

We run DLE 6000 times with new batch of Xq each time and obtain an empirical distribution of nqˆθ. We show qqplots of its marginal distributions vs. N(0, V1,1), N(0, V2,2), the asymptotic distribution predicted by Theorem 2 whose variance V is approximated by Xq and ˆθ. Figure 1a shows all quantiles between the empirical marginals and predicted marginals are very well aligned. We also scatter-plot nqˆθ together with the predicted 95% and 99.9% conﬁdence interval in Figure 1b. It can be seen that the empirical joint distribution of nqˆθ has the same elongated shape as predicted by Theorem 2 and agrees with the predicted conﬁdence intervals nicely.

One of our major contributions is proving DLE attains the Cramér-Rao bound. We now compare the variances of the estimated parameter ˆθ using Gamma p(x; θ) = Γ(5, θ), θ = 1 and Gaussian mixture model p(x; θ) = .5N(θ, 1) + .5N(1, 1), θ = 1 across DLE, SM and KSD. Varnq[ˆθ] are shown on Figure 1c and 1d. For DLE, we set f(x) := [x, x2] and for KSD, we use a polynomial kernel with degree 2. Note we particularly choose p(x; θ) to be tractable so we can compute MLE and Cramér-Rao bound easily. It can be seen that all estimators have decreasing variances and MLE, being one of the minimum variance estimators, has the lowest variance. However, DLE has the second lowest variances in both cases and quickly converges to Cramér-Rao bound after nq = 150. In comparison, both KSD and SM maintain higher levels of variances.

7.2 Unnormalized Model Using Pre-trained Deep Neural Network (DNN)

In this experiment, we create an exponential family model p(x; θi) := exp[θ i ψ(x)], x R784 where ψ(x) R20 is a pre-trained 3-layer DNN. ψ(x) is trained using a logistic regression so that the classiﬁcation error is minimized on the full MNIST dataset over all digits. Clearly, p(x; θi) does not have a tractable normalization term. The dataset Xqi contains nq = 100 randomly selected images from a single digit i and we use DLE and NCE to estimate ˆθi for each digit i. For DLE, we set f(x) = ψ(x). Though we can only obtain an unnormalized density for each digit, it can be used to rank images and ﬁnd potential inliers and outliers.

In Figure 2 we show images that are ranked either among the top two or bottom two places when sorted by log p(x; ˆθi), for each digit i. It can be seen that, when ˆθ is estimated by DLE, images ranked the highest are indeed typical looking images, while the lowest ranking images tend to be outliers in that digit group. However, in comparison, when ˆθ is estimated by NCE, some highest ranked images are distorted while some lowest ranked image look very regular. This experiment shows the usefulness of DLE as a model inference method when working with a complex model (DNN) on a high dimensional dataset (d = 784) using relatively small number of samples (nq = 100).

8 Conclusion and Discussion

In this paper, we introduce a model inference method for unnormalized statistical models. First, Stein density ratio estimation is used to ﬁt a ratio and to approximate the KL divergence. The model inference is done by minimizing such an approximated KL divergence. Despite promising theoretical and experimental results, future works are needed to demonstrate a systematic way of choosing Stein features in different scenarios as the performance of DLE depends heavily on such choices.

Acknowledgements

This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. TK was partially supported by JSPS KAKENHI Grant Number 15H01678, 15H03636, 16K00044, and 19H04071. Authors would like to thank Dr. Carl Henrik Ek and three anonymous reviewers for their insightful comments.

[1] H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control, 19(6):716 723, 1974.

[2] A. Barp, F-X. Briol, A. Duncan, M. Girolami, and L. Mackey. Minimum stein discrepancy estimators. ar Xiv:1906.08283, 2019.

[3] W. Y. Chen, L. Mackey, J. Gorham, F. X. Briol, and C. Oates. Stein points. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 844 853, 2018.

[4] K. Chwialkowski, H. Strathmann, and Arthur Gretton. A kernel test of goodness of ﬁt. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 2606 2615, 2016.

[5] H. Cramér. Mathematical methods of statistics. Princeton university press, 1946.

[6] J. Ding and A. Zhou. Eigenvalues of rank-one updated matrices with some applications. Applied Mathematics Letters, 20(12):1223 1226, 2007.

[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems 27, pages 2672 2680, 2014.

[8] J. Gorham and L. Mackey. Measuring sample quality with stein s method. In Advances in Neural Information Processing Systems 28, pages 226 234, 2015.

[9] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012.

[10] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 297 304, 2010.

[11] J. Hayakawa and A. Takemura. Estimation of exponential-polynomial distribution by holonomic gradient descent. Communications in Statistics-Theory and Methods, 45(23):6860 6882, 2016.

[12] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002.

[13] A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6:695 709, 2005.

[14] A. Hyvärinen. Some extensions of score matching. Computational statistics & data analysis, 51(5):2499 2512, 2007.

[15] W. Jitkrittum, W. Xu, Z. Szabó, K. Fukumizu, and A. Gretton. A linear-time kernel goodnessof-ﬁt test. In Advances in Neural Information Processing Systems 30, pages 261 270, 2017.

[16] Y. Li and R. E. Turner. Gradient estimators for implicit models. In International Conference on Learning Representations, 2018.

[17] L. Lin, M. Drton, and A. Shojaie. Estimation of high-dimensional graphical models using regularized score matching. Electronic journal of statistics, 10(1):806, 2016.

[18] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems 29, pages 2378 2386, 2016.

[19] Q Liu, J. D. Lee, and M. Jordan. A kernelized stein discrepancy for goodness-of-ﬁt tests. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pages 276 284, 2016.

[20] S. Lyu. Interpretation and generalization of score matching. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 359 366. AUAI Press, 2009.

[21] J. K. Merikoski and R. Kumar. Inequalities for spreads of matrix sums and products. Applied Mathematics E-Notes [electronic only], 4:150 159, 2004.

[22] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, New York, NY, USA, second edition, 2006.

[23] C. J. Oates, M. Girolami, and N. Chopin. Control functionals for monte carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695 718, 2017.

[24] C. R. Rao. Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37:81 91, 1945.

[25] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, 2005.

[26] P. Sánchez-Moreno, A. Zarzo, and J. S. Dehesa. Jensen divergence based on ﬁsher s information. Journal of Physics A: Mathematical and Theoretical, 45(12), 2012.

[27] J. Shi, S. Sun, and J. Zhu. A spectral approach to gradient estimation for implicit distributions. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4644 4653, 2018.

[28] B. Sriperumbudur, K. Fukumizu, A. Gretton, A. Hyvärinen, and R. Kumar. Density estimation in inﬁnite dimensional exponential families. The Journal of Machine Learning Research, 18(1): 1830 1888, 2017.

[29] C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, pages 583 602. University of California Press, 1972.

[30] M. Sugiyama, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, pages 1433 1440, 2008.

[31] M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012.

[32] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.

[33] D. Wang and Q. Liu. Learning to draw samples: With application to amortized MLE for generative adversarial learning. ar Xiv preprint ar Xiv:1611.01722, 2016.

[34] H. Yanai, K. Takeuchi, and Y. Takane. Projection Matrices, Generalized Inverse Matrices, and Singular Value Decomposition. Springer-Verlag New York, 2011.

[35] S. Yu, M. Drton, and A. Shojaie. Generalized score matching for non-negative data. ar Xiv preprint ar Xiv:1812.10551, 2018.