# proper_losses_for_discrete_generative_models__b9499740.pdf

Proper Losses for Discrete Generative Models

Dhamma Kimpara 1 Rafael Frongillo 1 * Bo Waggoner 1 *

We initiate the study of proper losses for evaluating generative models in the discrete setting. Unlike traditional proper losses, we treat both the generative model and the target distribution as black-boxes, only assuming ability to draw i.i.d. samples. We define a loss to be black-box proper if the generative distribution that minimizes expected loss is equal to the target distribution. Using techniques from statistical estimation theory, we give a general construction and characterization of black-box proper losses: they must take a polynomial form, and the number of draws from the model and target distribution must exceed the degree of the polynomial. The characterization rules out a loss whose expectation is the crossentropy between the target distribution and the model. By extending the construction to arbitrary sampling schemes such as Poisson sampling, however, we show that one can construct such a loss.

1. Introduction

Generative models are widely used tools in machine learning and statistics. For example, Generative Adversarial Networks (GANs) have recently been successful particularly in natural language and image generation. However, the evaluation of generative models is still an open area of research, with many evaluation methods proposed (Borji, 2019; Theis et al., 2015). This paper investigates theoretical foundations for evaluating generative models using a proper losses approach.

Specifically, we consider evaluating generative models that aim to match some underlying target distribution. For example, a GAN s goal may be to produce sentences from the same distribution as a random sentence drawn from

*Equal contribution 1University of Colorado Boulder. Correspondence to: Dhamma Kimpara <dhamma.kimpara@colorado.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Wikipedia; or to produce an image of a human face drawn from the same distribution as all U.S. passport photos. In areas such as climate modeling or weather forecasting, the goal may be to produce possible future trajectories from the same distribution as the actual climate. We abstract away from how the model is trained and learned; our focus is only on methods of evaluating the model. We leave the issue of training for future investigation.

We take the approach of the proper losses and proper scoring rule literature (Mc Carthy, 1956; Savage, 1971; Gneiting & Raftery, 2007), using one or more observations drawn from the target distribution to evaluate the model. However, many generative models are essentially black boxes . One typically cannot obtain a closed form expression for the probabilities a model assigns to different outputs. This rules out using traditional proper losses for evaluating distributions, such as ℓ2 loss or log loss. As a theoretical foundation, we instead assume only that we can draw independent and identically-distributed (i.i.d.) observations from the model p and compare these to observations from the target distribution q. The question is whether, and/or how, one can design losses under these restrictions that are proper: the expected loss is minimized by setting the model s distribution equal to the target, i.e. setting p = q.

Our results. As the initial work taking this approach, we focus on distributions over discrete, usually finite, sample spaces. We discuss extensions to the continuous setting in Section 7. First, we consider an easier problem: If we had full access to the target distribution q, i.e. in closed form or as an oracle, can we design proper losses for evaluating the model p from samples? We call this the report-black-box (RBB) setting. We show that the naive approach of plugging the empirical distributions directly into a distance function such as ℓ2 does not yield a proper loss. However, by using the samples to construct unbiased estimators of the error introduced, we can correct for them and produce losses that are in fact proper.

Extending the unbiased-estimator approach, we characterize RBB-proper losses as those whose expectation is a polynomial in the model distribution, e.g. expected loss p q k k for even integers k. For such polynomials, we explicitly construct RBB-proper losses using the classical theory of unbiased estimators. Furthermore, the minimum number of

Proper Losses for Discrete Generative Models

observations that must be drawn from the model is exactly the degree of the polynomial. On the other hand, the characterization implies impossibility results for many popular forms of distances, including the cross entropy (expected log loss).

Second, we consider the full problem: what if we only have sample access to the target distribution q as well as the model p? Leveraging the above results, we give a similar characterization and construction for black-box (BB) proper losses. Again, the degree of the polynomial in p (respectively, q) governs the the size of the necessary and sufficient sample that must be drawn.

Generalizing, we consider more general sampling schemes that do not draw a predetermined number of observations. In particular, using Poisson sampling, we are able to overcome the above impossibility result and construct the first blackbox proper loss that in expectation equals the cross entropy between p and q.

Finally, we experimentally evaluate our losses as a proof of concept.

1.1. Related Work

Our approach is based on the axiomatic approach in the proper scoring rule and proper losses literature, e.g. (Gneiting & Raftery, 2007). Most similar to our work in this tradition is Haghtalab et al. (2019), which examined convergence rates of the log loss for distribution learning in a setting similar to our simplified setting. Our characterizations will cover these proper scores as a special case, along with multi-observation losses that elicit a distribution (Casalaina-Martin et al., 2017).

There are many losses used in evaluation and training of GANs and other Neural Network (NN) based generative models (Borji, 2019; Theis et al., 2015). In adversarial training, much attention is given to obtaining unbiased gradients. These training losses cannot be translated into a proper loss because the loss is in a variational form that is inherent to the adversarial training method (Goodfellow et al., 2014; Bi nkowski et al., 2018). However, the energy distance, a special case of the Maximum Mean Discrepancy (MMD) metric, has been used in its closed form to directly train NN based generative models (Dziugaite et al., 2015; Li et al., 2015; Sz ekely & Rizzo, 2005; Bi nkowski et al., 2018). The MMD in general is typically only available in a variational form and thus is not proper in practice. However, the energy distance actually can be used to construct a loss satisfying our definition of black-box proper. So it can be viewed as a pre-existing proof of concept for the ideas formalized and generalized in this paper. See Appendix G for further discussion.

In distribution learning (Han et al., 2020) and classical

machine learning (Nguyen et al., 2010; Gy orfi & Van der Meulen, 1987; Hall & Morton, 1993; Joe, 1989), there is a line of work devoted to estimating divergences between pairs of distributions. While these literatures provide convergence and consistency results, the estimators and distances generally do not result in proper losses.

2. Background

For this work N = {0, 1, 2, 3, . . . }. We primarily work with distributions over a finite domain X. The set of all probability distributions over X is denoted by X . We denote a distribution by a vector of probabilities p X RX , where px is the probability p places on x X. We use δx X to denote an indicator vector, i.e., the distribution placing probability one on x. Norms without a subscript are 2-norms: = 2.

In our setting, there is target distribution q X . We will generally use Y to denote observations drawn from q. We aim to evaluate a model that we will represent as p X , also a distribution. We will generally use X to denote observations drawn from p. Uppercase letters generally refer to random variables while lowercase letters are realizations, e.g. X = x.

We will also use various unbiased estimators from classical statistical estimation theory (see Appendix A). A function f is an unbiased estimator for a parameter θ of a family of distributions {Fθ} if, for any θ and any random variable Z Fθ, we have E f(Z) = θ. Unless otherwise specified, we will always use the minimum variance unbiased estimator (MVUE, see Appendix A).

We next recall the classical approach to evaluating p, which assumes full access to p in closed form. Then we introduce our setting, where we cannot access p except by drawing samples.

2.1. The classical approach: proper losses

We proceed with our theory via the perspective of proper losses. This literature was developed to elicit and evaluate general statistical reports or predictions from an agent. Introduced in (Brier, 1950), a proper loss (also historically termed a proper scoring rule) is a function r(p, y) that assigns a loss to a model or forecast p on an observation y, where y is drawn from the target q. As we will see, proper losses do not apply in our setting because they assume ability to query the value of px on any x. Nevertheless, they are a useful starting point. Definition 2.1. A loss function r : X X R is proper if for all p, q X , E y q r(q, y) E y q r(p, y). A loss is

strictly proper if the above inequality is strict for all p = q.

In other words, for any fixed target distribution q, the op-

Proper Losses for Discrete Generative Models

timal model p (i.e. the one that minimizes expected loss) is p = q. A classic result fully characterizes all proper losses via Bregman divergences, which can be used as measures of distance between two distributions. For reference, we define Bregman divergences and recall the scoring rule characterization in Appendix D.

The two most common proper losses are the squared loss r(p, y) = p δy 2 2, where δy is the indicator vector on y; and the log loss r(p, y) = log py, whose expectation is the cross-entropy ℓ(p, q) = P

X qx ln px.

3. Report Black Box Proper

To develop our results, in this section we consider a simplified setting where we have full access to the target distribution q. We aim to evaluate the model p based only on i.i.d. observations drawn from it. In later sections, we will assume only sample access to q as well.

3.1. Basic definitions

To evaluate p, we will draw i.i.d. observations from p. Formally, we draw a sample (X1, . . . , Xn) where the Xi are independent random variables taking values in X, each distributed according to p. It will be convenient to represent the sample as a histogram H NX , where Hx = |{i : Xi = x}|. It is without loss of generality to consider loss functions that take H as input rather than the individual samples.1

We use Hn to denote the set of histograms arising from n samples, i.e. Hn = {h NX : h 1 = n}. We write H pn to denote that the random histogram H Hn is distributed according to pn, the distribution over all samples of size n drawn i.i.d. from p. Given a histogram h Hn, the empirical distribution is ˆp = 1

Definition 3.1. A report-black-box (RBB) loss is a function L : Hn X R. Here L(h, q) is the loss assigned to a histogram h of n samples drawn from the model when the target distribution is q.

Definition 3.2. For a RBB loss L : Hn X R, the associated expected loss is L(p, q) = E H pn L(H, q).

The key property we want our loss functions to satisfy is properness, i.e., that expected loss is minimized by setting the model p equal to the target q. Therefore, the following definition becomes useful:

Definition 3.3. A function ℓ: X X R is called a

1By exchangeability of i.i.d. samples, any function f(S) of the sample S = (X1, . . . , Xn) can be simulated by a function g(H) of the histogram, where g simply arranges the samples that make up H in a uniformly random order to obtain S and applies f(S ). Then g(H) has the same distribution as f(S), because S has the same distribution as S.

proper divergence if for all fixed q,

ℓ(q, q) ℓ(p, q) ( p).

It is called a strictly proper divergence if the above inequality is strict for all p = q.

Examples of proper divergences are the squared distance p q 2 and the cross-entropy E Y q log p Y . A proper di-

vergence ℓrepresents our goal: we would like to use such a divergence to evaluate p. In general, we cannot use ℓ directly, because evaluating the divergence requires access to the closed form of p, and we can only draw observations from p. However, we can implement a divergence ℓif we can construct a RBB loss L whose expectation is ℓ. As such, the following captures what it means for L to be proper in our setting.

Definition 3.4. A report-black-box loss function L is reportblack-box proper (RBB proper) if L(p, q) is a proper divergence. If ℓis some proper divergence and there exists L such that L = ℓ, we will say that L implements ℓand that ℓ is implementable.

3.2. Proof of concept: squared loss

Is there any proper divergence that is implementable? A priori, it might seem that given a loss L : Hn X R, there is always a way to tweak a misreport p to put higher weight on certain points and improve the expected loss.

Let us begin by investigating the ℓ2 divergence ℓ(p, q) = p q 2 2. In the traditional proper loss (or proper scoring rule) setting, this yields a proper loss r(p, y) = p δy 2 2. Can we utilize squared loss as a RBB proper loss function by simply replacing p with ˆp? In fact, no:

Claim 1. The loss L(h, q) = ˆp q 2 2, where ˆp = 1

nh is the empirical distribution, is not RBB proper for any sample size n.

Sketch. A straightforward calculation, using p = E ˆp, shows that E ˆp q 2 = p q 2 + X

x X Var(ˆpx). In

general, this is not minimized by p = q; for example, with a 0.1-weighted coin, the optimal model p is always a coin with weight strictly less than 0.1 (notice this decreases the variance of ˆp).

In summary, the expected loss of this naive approach is the proper divergence p q 2 2 plus an extra term. However, the key insight is that the extra term can be estimated unbiasedly from a finite number of observations. Let n 2 and let s2 n(α) = 1 n 1 α(1 α)2 + (1 α)α2 . Then (Claim A.2.1) s2 n is an unbaised estimator for Var(ˆpx), that is, E ˆp pn s2 n(ˆpx) = Var(ˆpx). This proves the following.

Proper Losses for Discrete Generative Models

Claim 2. The ℓ2 divergence ℓ(p, q) = p q 2 is implementable. In particular, for any n 2, the following loss is RBB proper and satisfies L(p, q) = p q 2 (here ˆp = 1

Ln(h, q) = ˆp q 2 X

X s2 n(ˆpx),

We discuss the reason underlying the variance term and generalize this construction to other divergences in Appendix D. A similar proof of concept can arise from considering the energy distance in continuous space, as discussed in Section 1.1 and Appendix G.

3.3. Minimum number of draws required

Now that we know it is possible to implement at least some proper divergences, a natural question is how many observations one needs to draw from p in order to do so. In cases where a generative model is expensive to sample, we might prefer to use RBB proper losses that can utilize a smaller sample size. To do so, we define the notion of a tight lower bound on the observations needed to implement a proper divergence.

Definition 3.5. Let n N. A proper divergence, ℓ, is nminimally-implementable if for all n n, there exists a RBB loss L : Hn X R that implements ℓand, for all k < n, there does not exist a RBB loss L : Hk X R that implements ℓ.

3.4. Characterization of Discrete Losses

We have seen that naively applying a proper divergence as a loss function introduces an extra penalty term, which can be corrected if we can unbiasedly estimate the penalty from samples. To make this approach fully general, we turn to the theory of U-estimation, which defines unbiased estimators. The key idea is that histogram H pn has a multinomial distribution. There are classical results (Lemma A.3.1) describing which functions of multinomials have unbiased estimators. We utilize these results to characterize the proper divergences that are n-implementable. Also, we characterize the minimal-implementability of every such implementable divergence. We first recall the definition of a polynomial function of a vector.

Definition 3.6. A function f : X R is a polynomial if it is of the form

x X pjk,x x ,

where the sum is over a finite index set K, where each jk NX is unique, and where each ak is a nonzero real number. In this case, the degree of f is maxk K jk 1, i.e. the largest sum of exponents of any monomial. We say a function is a polynomial in its jth argument of degree n

if, for all fixed values of the other arguments, the induced function of the jth argument alone is a polynomial, and there exists a maximum degree n of any such induced polynomial.

Theorem 1. Let ℓ(p, q) be a proper divergence. Then ℓis implementable if and only if it is a polynomial in its first argument. Furthermore, if ℓis implementable, then ℓis n-minimally implementable where n is the degree of the polynomial.

Given a sample-size budget of n, Theorem 1 tells us which proper divergences can be implemented in evaluating a black-box model. Furthermore, the proof will actually construct a loss that minimally-implements the proper divergence.

Proof. Let ℓbe a proper divergence that is a polynomial in its first argument, in particular, of degree n. We show ℓis implementable using sample size n. Write ℓin the form of Definition 3.6, i.e. for each fixed q,

ℓ(p, q) = X

k K(q) a(q) k Y

x X p j(q) k,x x ,

where K(q) is finite, each a(q) k is a nonzero constant, and each j(q) k 1 n. By classical results (Lemma A.3.1), any given monomial in p of degree at most n has an unbiased estimator using n samples from p. In particular, the minimum-variance unbiased estimator (MVUE) of the

X p j(q) k,x x is:

and satisfies E H pn

h tn,j(q) k (H) i = Y

X p j(q) k,x x (Lemma A.3.1).

Therefore, the loss L(h, q) = P

k a(q) k tn,j(q) k (h) satisfies L = ℓ, and it implements ℓ.

Now suppose ℓis not a polynomial of degree at most n in its first argument. That is, there exists q such that ℓ(p, q) either has higher degree or is not a polynomial at all. The characterization of the U-estimable functions under the multinomial distribution, Lemma A.3.1, directly implies there does not exist an unbiased estimator for ℓ( , q) using sample size n, i.e. there does not exist L : Hn X R such that E H pn L(H, q) = ℓ(p, q). This shows that non-

polynomials are not implementable; and that polynomials of degree n > n are not implementable with only n observations. For the other part of minimally-implementable, our construction above implies that for all n deg(ℓ) there exists a loss L : Hn X R that implements ℓ.

Corollary 1. Let ℓbe a polynomial divergence as defined in definition 3.6. If L is constructed as according to Theorem 1 to implement ℓ, then L can be computed in O(P

k K jk 1) = O(|K|deg(ℓ)) time.

Proper Losses for Discrete Generative Models

We immediately obtain some positive examples, such as:

Corollary 2. For any even integer k 2, the proper divergence p q k k = P

x(px qx)k is implementable, and in particular, is k-minimally implementable.

However, we also obtain impossibility results:

Corollary 3. The cross-entropy P

x qx log px is not implementable for any finite sample size.

In Section 5, we will return to this example and show that cross-entropy actually can be implemented with a more creative approach to sampling.

3.5. Linearly Decomposable Losses

We now examine a special case of the previous characterization that includes many popular distance metrics. In our motivating example we found a report-black-box proper loss that in expectation is the the squared loss. It turns out to be the sum over all X of a coordinate-wise loss. We now leverage this and construct losses that implement certain distances that are linearly decomposable. We extend these losses to handle the case when X is countably infinite in Appendix C.

Corollary 4. A linearly decomposable proper divergence, ℓ(p, q) = P

X d(px, qx), is n-minimally-implementable if and only if d( , ) is a polynomial with degree equal to n in the first argument.

4. Black Box Properness

The RBB setting, while an important step, is not the most common in evaluating generative models. In this section, the fully black-box setting, we must evaluate only with samples from both the candidate model and the target distribution. We extend our definitions to encompass this setting. The RBB setting will be a special case of this more general setting.

Definition 4.1. A black-box (BB) loss is a function L : Hn Hm R where L(hp, hq) is the loss assigned to histogram hp of n samples drawn from the model on histogram hq of m samples drawn from the target distribution.

Definition 4.2. For a black-box loss L : Hn Hm R, the associated expected loss is L(p, q) = E Hp pn

Hq qm L(Hp, Hq).

Definition 4.3. A black-box loss function L is black-box proper (BB proper) if L is a proper divergence ℓ. If ℓis some proper divergence and there exists L such that L = ℓ, we will say that L implements ℓand that ℓis implementable.

We again define the notion of minimal-implementability. In cases where the target distribution is difficult to sample, we might prefer to use BB proper losses that can utilize a

smaller target sample size. For example, generative models for forecasting e.g. climate may only have access to one observation from q, i.e. the weather that actually occurs on a given day. On the other hand, other settings may present other tradeoffs between model and target sample size.

Definition 4.4. A proper divergence, ℓ, is (n , m )- minimally-implementable if for all n n and m m

there exists a BB loss L : Hn Hm R that implements ℓand for all (k, j) where k < n or j < m , there does not exist a loss L : Hk Hj R that implements ℓ.

4.1. Proof of concept: squared loss

We provide an illustrative example for the ℓ2 proper divergence by extending the techniques we developed in Theorem 1. Again, the key idea is an unbiased estimator, namely δBin j,k (t) = t(t 1) (t k+1) j(j 1) (j k+1). By Lemma A.4.1, if

T Binom(j, α) and j k, then E δBin j,k (T) = αk. The point is that, for any x, the number of observations Hp x is distributed Binomially, as is Hq x, and they are independent.

Claim 3. For distributions over a finite domain X, the squared loss q p 2 is implementable. In particular, for any n 2 and m 2, it is implemented by

Ln,m(Hp, Hq) = X

" Hp x(Hp x 1) n(n 1) 2Hp x Hq x nm

+ Hq x(Hq x 1) m(m 1)

We observe that, although Ln,m contains a sum over all X, only at most n + m terms will be nonzero, so Ln,m is efficient to implement regardless of the size of the domain X

Proof. Observe that Ln,m(Hp, Hq) = P

X δBin n,2 (Hp x) 2δBin n,1 (Hp x)δBin m,1 (Hq x) + δBin m,2 (Hq x) . Using that δBin n,i (Hp x) is an unbiased estimator for pi x, and symmetrically for δBin m,i (Hq x), along with independence of Hp and Hq, we immediately get E L(Hp, Hq) = X p2 x 2pxqx + q2 x = p q 2.

The fact that there exists any proper loss with only n = 2 observations from p and m = 2 observations from q is somewhat remarkable: however large the sample space X, for example all sentences up to a fixed length or all images of a certain number of pixels, merely 4 total observations suffice to incent the learner to exactly set the model p to match the target q. In fact, slightly better is possible: the Brier score, i.e. the proper divergence P

X p2 x 2pxqx, is (2, 1)-minimally-implementable, as our next result will imply.

Proper Losses for Discrete Generative Models

4.2. Characterization of Discrete Losses

As in the RBB setting, we utilize the theory of U-estimation to characterize the proper divergences that are implementable by BB losses. The proof follows similarly because Hp and Hq are independent random variables, so the RBB analysis above can essentially apply to each separately. The proof appears in Appendix B.

Theorem 2. Let T be the set of all proper divergences. Let Fn be the set of all polynomials in the first argument with degree n and Fm be the set of all polynomials in the second argument with degree m. The set of all (n, m)- implementable proper divergences is

BBn,m = T Fn Fm.

Furthermore, a proper divergence ℓis (j, k)-minimallyimplementable if and only if it has degree in the first argument equal to j and degree in the second argument equal to k.

4.3. Consequences and connections to proper scoring rules

There are a number of consequences and special cases of note. One class of special cases is m = , which we use to denote the case where we have full access to the target q in closed form. Then we obtain BBn, , which reduces to the report-black-box (RBB) setting. Similarly, n = denotes the case where we have full access to the model p in closed form, which reduces to the traditional proper loss setting. In particular, BB ,1 is the set of proper losses. Furthermore, by the same reasoning as in Theorem 2, we know that any ℓ BB ,1 must be linear in the second argument and must also be a proper divergence. Hence one could follow this reasoning as an alternative approach to characterizing all proper scoring rules (Theorem D.0.1).

Corollary 5. Let ϕ(BB ,1) be all the proper scoring rules (in the form of losses). Then

L ϕ(BB ,1) { L : L is a polynomial in the

first argument with degree n}.

Corollary 5 is relevant to fields using generative models to forecast events, such as in weather or climate forecasting. In these cases, we may be able to draw n i.i.d. observations from the learner s model p, but only m = 1 observation from nature, i.e. the weather that actually occurs. In such cases, Corollary 5 implies that we can construct a BB proper loss from any proper scoring rule that is a polynomial in p.

Corollary 6. For n, m N, BBn,m BBn, BB ,m. In other words, if a divergence is (n, m)-BB implementable then it is n-RBB implementable and implementable via a multi-observation proper loss with m observations.

Corollary 7. For n, m N, BBn,m BBn+1,m BBn,m+1 BBn+1,m+1.

Corollary 8. If T in Theorem 2 is the set of all strictly proper divergences, then BB1,m = .

5. Poisson Sampling

So far our results show that proper divergences must all be polynomial in the distributions in order to be implementable. As such, cross entropy cannot be (n, m)-implemented for any finite n, m N. We now show that cross entropy can be implemented if we generalize to other sampling schemes. A sampling scheme is a (possibly randomized) stopping rule determining the number of samples to draw from a generative model. In Appendix E, we formally define and fully characterize implementable proper divergences under arbitrary sampling schemes. Here, we focus on the example of Poisson sampling specifically for the cross-entropy divergence. Other sampling schemes admit a multitude of other distinct classes of U-estimable functions.

We will determine the implementable proper divergences under Poisson sampling schemes. Poisson sampling gives us much more powerful estimators than in the scheme where we draw a deterministic sample size. The Poisson distribution is a discrete probability distribution over N with parameter θ > 0 and probability mass function f(j; θ) = Pr[T = j] = θje θ

The sampling scheme is as follows. Let α, β > 0. First randomly draw the sample sizes N Poi(α) and M Poi(β). Then draw N observations from p and M observations from q. Poisson sampling gives us two powerful properties. First, the counts of each outcome, hp x (resp. hq x), are independent and distributed according to Poi(αpx) (resp. Poi(βqx)). Second, we are able to unbiasedly estimate θk for any k N and thus can unbiasedly estimate any power series involving θ. This estimation is achieved by the Poisson estimator:

δP oi k (t) = t(t 1) (t k + 1).

By Lemma A.5.1, if T Poi(θ) then E δP oi k (T) = θk

for any k N. We will use this estimator extensively in this section. The first result immediately follows from this estimator. The second follows from the characterization of U-estimable functions in Lemma A.5.1.

Corollary 9. For any n, m N and α, β > 0, BBn,m BBP oi(α),P oi(β), the set of all implementable functions with Poisson sampling from p and q.

Corollary 10. A proper divergence is (Poi(α), Poi(β))- implementable for any α, β > 0 if and only if it has an equivalent power series expression in the first and second arguments with non-negative integer powers and the power series satisfies 1) every coefficient of the first and second

Proper Losses for Discrete Generative Models

arguments is finite and 2) if the series diverges for any argument, the proper divergence also diverges in the same direction (goes to + or ).

The proof of Corollary 10 appears in Appendix B. Crucially for implementing the cross entropy, many functions in C

have an equivalent (Taylor) series that satisfy the conditions of corollary 10. We will use the Taylor series for ln(x) in the next section.

5.1. Cross Entropy

As a consequence of Corollary 10, we can construct a generic-black-box proper loss that implements the cross entropy. By similar methods we can also implement the Shannon entropy and the KL divergence. Note that with deterministic sampling, we cannot construct such a loss.

Lemma 1. Let hp x = P

y =x hp y be number of occurences of all the outcomes except x. Then for any α, β > 0 the loss,

L(hp, hq) = X

1 β δP oi 1 (hq x)

1 k 1 αk δP oi k (hp x),

(Poi(α), Poi(β))-implements the cross entropy, ℓ(p, q) = P

X qx ln(px).

We observe that this loss is always finite for a finite sample because δP oi k (t) = 0 when t < k. In fact, the loss is efficient to evaluate, i.e. polynomial time in terms of the number of samples drawn, as that number governs the number of nonzero terms. Memoization of the infinite sums and the summands in the infinite sum provides the most efficient way to compute this loss. Furthermore, the computation is highly parallelizable.

Corollary 11. Let L be the loss that implements cross entropy in Lemma 1, N Poi(α), M Poi(β), and c be the number of unique non-zero integers in {hp x}x X . Then L can be computed in O(|X| + c N) = O(|X| + N 1.5) time. If the histogram counts are stored in a dictionary-like data structure, and an element only has an entry if it was observed, then the amortized runtime is O(min(|X|, β) + α1.5).

In correspondence with the cross entropy, this loss can equal infinity in expectation for certain p, q, although it is finite for every hp, hq. We note that the cross entropy can also be (Poi(α), m)-implemented, with any m 1, with the loss L(hp, hq) = P

X ˆqx P k=1 1 k 1 αk δP oi k (hp x), where ˆq = 1

Proof. We will use the Taylor expansion for ln(t). For t [0, 1], ln(t) = P k=1 1 k(1 t)k. Note that the series diverges to at t = 0 but also limt 0 ln(t) = .

Next, we will use that Hp and Hq are independent; that Hq x is distributed Poi(βqx); and that Hp x is distributed Poi(α(1 px)).

Hq qm L(Hp, Hq) =

1 β δP oi 1 (Hq x)

1 k 1 αk δP oi k (Hp x)

1 k (1 px)k

X qx ln(px),

including the case where both the proper divergence and the expected value of the BB loss equals (i.e. there exists x with qx > 0 and px = 0).

We note that the KL-divergence, ℓ(p, q) = P

px , can be implemented as well. In fact, it equals the crossentropy plus the Shannon entropy of q. Shannon entropy can be estimated unbiasedly with Poisson sampling because Hq x and Hq x are distributed as independent Poissons, so

1 k 1 βk δP oi k (Hq x) = qx

1 k (1 qx)k = qx ln(qx).

6. Experiments

For a proof of concept, we performed numerical experiments to evaluate our loss functions on a variety of pairs of distributions. We focused on the black-box setting, since this evaluation setting is more difficult than the report blackbox setting. For this section we define K := |X| and we call divergences distances.

We consider the task of distinguishing different power law distributions, which often arise in connection with natural language data. Results for other pairs and types of distributions appear in Appendix H.

For each pair of distributions p and q, at each number of total samples, we measured the absolute deviation between the loss value and the true distance between the distributions. We drew up to K1.5 total samples. We repeated this experiment for various batch sizes, where at each iteration, we drew the same batch size from p and q.

Of course, our losses work even with different batch sizes. For simplicity we kept the batch sizes the same.

We can discern from our experiments that given a budget of samples, the black-box loss is generally more accurate when all the samples are used in the computation of a single black-box loss value. This is opposed to splitting the sample

Proper Losses for Discrete Generative Models

(a) Squared Distance (b) Cross Entropy

Figure 1. The y-axis is the absolute deviation between the black-box loss value and the true distances (Squared distance and Cross Entropy). K = 10, 000, p and q are Zipfians with parameters 1 and 2, respectively. 30 trials for each parameter setting was recorded (including batch size). The horizontal bars represent the maximum absolute deviation of any of the 30 trials. The solid markers represent the average of the trials. Squared distance was estimated using the loss from claim 3. Cross entropy was estimated with a poisson sampling loss. Batch sizes were the same between draws from p and q.

into smaller batches and computing the average of the BB loss over all the batches. This suggests a theoretical result that it is always better to use the largest possible batch size. Note however that in the squared distance case, varying the batch size did not change the accuracy much. We also note that the figures in appendix H show that the Poisson estimator consistently under-estimates the true loss value.

We also exhibit excellent convergence. In both cases, at and past K log K total samples, the average value of our losses over all the trials are within 10% of the true distance between the distributions. The normalized plots showing multiplicative error are included in the appendix.

7. Discussion

Larger batches are better. As we saw in the experiments, if one has a budget of samples, it is best to use all those samples in a single instance of a BB proper loss function, rather than split those samples up into smaller batches and taking the average of the loss over all the batches. In general, a direct extension would be to analyze the variance and convergence rate of these BB losses with regards to the batch size.

Losses for continuous domains. We have focused on the discrete case in this work, leaving the continuous case to further investigation. However, we illustrate an initial result in the continuous setting. Let Fp( ) be the CDF of distribution p and FS(x) = |{i:Xi x}|

n be the empirical CDF based on sample S = (X1, . . . , Xn) where each Xi is drawn i.i.d. from p.

Theorem 3. Let X = R and αi R for all i. Let a proper divergence be of the form ℓ(p, q) = R

R g({Fp(x + αi)}m i=1, )dx. If g( , ) is a polynomial in the first argument with powers j(q) k Z|X| + such that j(q) k 1 = n, the number of samples, then g is n-minimally-implementable.

The proof appears in Appendix B. As a corollary, we are able to implement the Cram er distance which we exhibit in Appendix F. These types of distances can easily be extended to a high dimensional distance by picking a direction at random and defining the empirical CDFs based on the hyperplane defined by that random direction. We illustrate this via a high dimensional version of the Cram er distance in appendix F.

Future work. A direction of future work is that of constructing black-box proper losses for continuous settings, which is the most common use-case for GANs. Another important study would be to investigate the properness of existing losses used in evaluation. Finally, it would be interesting to investigate the use of BB proper losses in evaluating implicit distributions of black-boxes for desired properties. For example evaluating a dice for uniformity or evaluating prepared quantum states.

Broader Impacts

The evaluation of generative models, such as GANs, is a very open question with important societal impacts in domains such as climate forecasting. We provide an initial theoretical foundation for this question. Instead of direct applications, we anticipate this work leading to further theo-

Proper Losses for Discrete Generative Models

retical investigation. It may inform practitioners choices of which losses they use for evaluating generative models. Of course, such evaluation can be used for ethical or unethical purposes. We do not know of particular risks or negative impacts of this work beyond risks of generative models in general.

Acknowledgements

We thank Claire Monteleoni and Amit Rege for helpful discussions and references. This material is based upon work supported by the US National Science Foundation under Grant No. IIS-2045347.

Bi nkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying mmd gans. In International Conference on Learning Representations, 2018.

Borji, A. Pros and cons of gan evaluation measures. Computer Vision and Image Understanding, 179:41 65, 2019.

Brier, G. W. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1 3, 1950.

Casalaina-Martin, S., Frongillo, R., Morgan, T., and Waggoner, B. Multi-observation elicitation. In Conference on Learning Theory, pp. 449 464. PMLR, 2017.

Cram er, H. On the composition of elementary errors: First paper: Mathematical deductions. Scandinavian Actuarial Journal, 1928(1):13 74, 1928.

Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. Training generative neural networks via maximum mean discrepancy optimization. ar Xiv preprint ar Xiv:1505.03906, 2015.

Glasser, G. J. Minimum variance unbiased estimators for poisson probabilities. Technometrics, 4(3):409 418, 1962.

Gneiting, T. and Raftery, A. E. Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association, 102(477):359 378, 2007.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

Gy orfi, L. and Van der Meulen, E. C. Density-free convergence properties of various estimators of entropy. Computational Statistics & Data Analysis, 5(4):425 436, 1987.

Haghtalab, N., Musco, C., and Waggoner, B. Toward a characterization of loss functions for distribution learning.

Advances in Neural Information Processing Systems, 32, 2019.

Hall, P. and Morton, S. C. On the estimation of entropy. Annals of the Institute of Statistical Mathematics, 45(1): 69 88, 1993.

Han, Y., Jiao, J., and Weissman, T. Minimax estimation of divergences between discrete distributions. IEEE Journal on Selected Areas in Information Theory, 1(3):814 823, 2020.

Hoeffding, W. Range preserving unbiased estimators in the multinomial case. In The Collected Works of Wassily Hoeffding, pp. 613 615. Springer, 1994.

Joe, H. Estimation of entropy and other functionals of a multivariate density. Annals of the Institute of Statistical Mathematics, 41(4):683 697, 1989.

Kolmogorov, A. N. Unbiased estimates. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 14(4):303 326, 1950.

Lehmann, E. L. and Casella, G. Theory of point estimation. Springer Science & Business Media, 2006.

Li, Y., Swersky, K., and Zemel, R. Generative moment matching networks. In International conference on machine learning, pp. 1718 1727. PMLR, 2015.

Mc Carthy, J. Measures of the value of information. Proceedings of the National Academy of Sciences, 42(9):654 655, 1956.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010.

Rockafellar, R. T. Convex Analysis, volume 36. Princeton University Press, 1970.

Savage, L. J. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783 801, 1971.

Sz ekely, G. J. and Rizzo, M. L. A new test for multivariate normality. Journal of Multivariate Analysis, 93(1):58 80, 2005.

Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models. ar Xiv preprint ar Xiv:1511.01844, 2015.

Proper Losses for Discrete Generative Models

A. Results from Statistical Estimation Theory

We have extensively utilized results from Unbiased Estimation (U-Estimation) theory. These estimators are fundamental to our construction of proper losses for generative models.

A.1. Definitions

Since there are possibly infinitely many U-estimators for many quantities, the literature provides a criteria for the best estimator:

Definition A.1.1. Let Y1, Y2, . . . , Yn be i.i.d. from some member of a family of densities, pθ, θ Ω. An estimator δ is a Minimum Variance Unbiased Estimator (MVUE) for g(θ) if for some n N, for all θ Ω,

1. δ is an unbiased estimator for g(θ), E δ(Y1, Y2, . . . , Yn) = g(θ),

2. Var(δ(Y1, Y2, . . . , Yn)) Var( δ(Y1, Y2, . . . , Yn)) for any other unbiased estimator δ.

A.2. MVUE for Variance

Fact A.2.1. (Canonical MVUE for Variance) Let (yi)n i=1, n 2, be i.i.d. realizations of a random variable Y . Then the MVUE for variance is

s2 n((yi)n i=1) := 1 n 1

Claim A.2.1. If Yi iid Ber(α) and Z = Y1 + Y2 + + Ym then Z Bin(m, α). Let Y = Z/m = 1

i Yi. Then for m 2,

s2 m( Y ) = 1 m 1[ Y (1 Y )2 + (1 Y )( Y )2]

is a MVUE for variance. Note that Var( Y ) = α(1 α)

E Z=m Y s2 m( Y ) = V ar( Y )

= 1 m2 V ar(

i=1 V ar(Yi) + X

i =j Cov(Yi, Yj)]

= 1 m2 [m V ar(Y1) + 0]

m E s2 m(Y1)

i=1 (Yi Y )2

i=1 (1Yi=1 Y )2

Proper Losses for Discrete Generative Models

m E 1 m 1[m Y (1 Y )2 + m(1 Y )(0 Y )2]

= E Y 1 m 1[ Y (1 Y )2 + (1 Y )( Y )2].

A.3. Multinomial Estimator

Lemma A.3.1. (Kolmogorov, 1950) Let Y M(m, p) be a multinomial random variable. A real-valued function f(p) has an unbiased estimator on the basis of an observation from Y if and only if f is a polynomial of degree at most m. The unique MVUE of such a polynomial is constructed using the following estimators (Hoeffding, 1994),

tm,jk(y) = Q

X yx(yx 1) (yx jk,x + 1) m(m 1) (m jk 1 + 1) .

Where yx is the number of observations of element x X in the sample and E tm,jk(Y ) = Y

X pjk,x x . The binomial

distribution is a special case of the multinomial distribution.

A.4. Binomial Estimator

Lemma A.4.1. (Lehmann & Casella, 2006). Let T Bin(m, α). Then f(α) is unbiasedly estimable if and only if f is a polynomial with degree m. The MVUE estimator for f(α) = αk, k m, is

δBin m,k(t) = t(t 1) (t k + 1) m(m 1) (m k + 1).

Hence E δBin m,k(T) = αk.

A.5. Poisson Estimator

Lemma A.5.1. (Glasser, 1962) A function of the Poisson parameter θ has an unbiased estimator if and only if the function can be expressed as a series in integer non-negative powers of θ. Let T Poi(θ). Then for all k N the MVUE estimator of θk is δP oi k (t) = t(t 1) (t k + 1).

Note that if t < k then δP oi k (t) = 0. Hence E δP oi k (T) = θk.

The MVUE for any estimable F(θ) can be constructed by writing the function as a power series and replacing all the θt

with its unbiased estimator given in Lemma A.5.1 (Glasser, 1962). Suppose our random variable for the count of x is Hx Poi(αpx). We can unbiasedly estimate pk x:

1 αk E δP oi k (Hx) = 1

αk (αpx)k = pk x.

B. Omitted Proofs

B.1. Omitted Proofs from Section 4

Proof of Theorem 2. Let ℓ BBn,m. We first show that ℓis (n, m)-implementable. ℓis a proper divergence by definition. ℓis a also a polynomial in both arguments with bounded degree, so let us write it in the following form:

ℓ(p, q) = X

X pik,x x Y

X qjk,x x ,

where K is finite; for all k K, ak is a nonzero constant; ik, jk NX with the pair unique for each k; ik 1 n and jk 1 m. The construction of the implementing loss is similar to that in the proof of Theorem 1. Again, we have that

Proper Losses for Discrete Generative Models

Hp Multinomial(n, p) and also Hq Multinomial(m, q). Hence we use the estimators from classical results, tik and tjk, to estimate each summand in ℓ:

Hq qm L(Hp, Hq) = E Hp pn

k K aktik(Hp)tjk(Hq) = ℓ(p, q).

Where the last equality follows by independence of Hp and Hq, and Lemma A.3.1. Thus ℓis (n, m)-implementable, and a proper divergence by definition.

For the converse, if ℓis a proper divergence but ℓ/ BBn,m, then by the characterization (Lemma A.3.1) of the U-estimable functions under a multinomial distribution, for all k, there does not exist tik or tjk such that E H pn tik(Hp) = Y

and similarly for Q X qjk,x x . Thus ℓis not (n, m)-implementable. Minimal-implementability also follows from this characterization, Lemma A.3.1.

Proof of Corollary 8. By Lemma A.3.1, the losses in BB1,m that are implementable are {g : g(p, q) = P

X fx(q)px}. Where the degree of each fx(q) is m. For a generic g we now find the report that minimizes the expected loss.

d dpg(p, q) = d

Thus any report minimizes the expected loss of any function that is (1, m)-implementable hence none of these expected losses are strictly proper divergences. In other words, all (1, m)-implementable divergences are constant for a fixed q.

B.2. Omitted Proofs from section 5

Proof of Corollary 10. By characterization in Lemma A.5.1 of functions estimable under a Poisson distribution, FP oi(α) = {ℓ( , ) : ℓis a power series in the first argument with non-negative integer powers}. FP oi(β) is similarly defined in terms of the second argument. The corollary follows by applying Theorem E.0.1.

B.3. Proof of Lemma 1

Proof. We will use the Taylor expansion for ln(x). For x [0, 1], ln(x) = P k=1 ( 1)k+1

k (x 1)k. Note that the series diverges to at x = 0 but also limx 0 ln(x) = . Shannon entropy is implemented by the fact that Hp x is a Poisson random variable distributed according to Poi(αp x) that is independent from Hp x.

Hq qm L(Hp, Hq) = E Hp pn

1 β δP oi 1 (Hq x)

k 1 αk δP oi k (Hp x)

1 β E Hq qm δP oi 1 (Hq x) X

k 1 αk E H pn δP oi k (Hp x)

k ( 1)k(px 1)k

X qx ln(px).

Proper Losses for Discrete Generative Models

B.4. Proofs from section 7

Proof of Theorem 3. We only need to show that we can unbiasedly estimate each term in g. The result then follows by linearity of expectation. To do this we will show that the vector valued random variable T = {FS(x + αi)}k i=1 is a function of a multinomial random variable. Hence the unbiased estimators and result follows from Lemma A.3.1.

Without loss of generality let α1 α2 αm and define α0 := . Now define Z Multinomial(n, (Fp(x + α1), Fp(x + α2) Fp(x + α1), . . . , Fp(x + αm) Fp(x + αm 1)). Z is a vector valued random variable where the count Zi, i {1, 2, . . . m}, corresponds to how many samples fall in the interval [x + αi, x + αi 1]. Hence we can rewrite the random variable FS(x + αi) as

FS(x + αi) = 1

n(Z1 + Z2 + Zi) = 1

Now if g is a polynomial in the first argument, then

g({Fp(x + αi)}m i=1, ) = X

i=1 Fp(x + αi)j(q) k,i.

Where aj(q) k subsumes the second argument. Now we show that the product, Qm i=1 Fp(x + αi)j(q) k,i, is unbiasedly estimable.

The result follows by linearity of expectation. Now by the condition of the theorem, j(q) k 1 n so we only have at most n distinct αi in each product hence we can ignore all the other αi that have a power of 0. Thus now we define a multinomial random variable as before except now only with the αi that are involved. Let s reindex j(q) k and α so that all the entries where j(q) k,i = 0 are above index B. Again let α1 α2 αB, then the corresponding random variable is Z Multinomial(n, (Fp(x + α1), Fp(x + α2) Fp(x + α1), . . . )). Thus by the multinomial characterization, we can estimate polynomials of the parameters of this distribution (we know n):

i=1 Fp(x + αi)j(q) k,i =

Fp(x + α1) +

γ=2 [Fp(x + αγ) Fp(x + αγ 1)] j(q) k,i

Since j(q) k 1 n for all k, this term will have degree at most n in the parameters of the multinomial distribution and so is unbiasedly estimable with the multinomial estimator. Each parameter is exactly 1

n E Zi. While this means there are different multinomial distributions for each product term in the polynomial, each term effectively sees these different distributions. Thus by linearity of expectation this is a valid way to construct the unbiased estimator.

C. Countably Infinite Domains

Lemma C.0.1. Let X be countably infinite. Let Xk X be a finite subset for all k N. Let a proper divergence be of the form

k=1 akd Xk(px, qx),

where P k=1 ak converges and d Xk is a (r, t)-implementable divergence on the empirical distribution restricted to Xk, and bounded for all Xk, p, and q. Then ℓis (r, t)-implementable.

For example, X could be a set of all english word sentences and Xk could be the set of all length k sentences.

Proper Losses for Discrete Generative Models

D. The Variance Bias Term and Connection to Proper Scoring Rules

One explanation of why the variance appears in the calculations for naive squared error has to do with Bregman Divergences and the Jensen Gap. For this section the empirical distribution is defined as ˆp = 1 |H|H

We recall the definition of a Bregman divergence:

Definition D.0.1. Let G be a convex, real valued function. Then the Bregman divergence of G is

DG(q, p) = G(q) [G(p) + G(p), q p ]

where d G(p) is a subgradient of G at p (Rockafellar, 1970).

One reason Bregman divergences are important is because they are known to characterize traditional proper losses:

Theorem D.0.1 (Mc Carthy (1956); Savage (1971); Gneiting & Raftery (2007)). A loss r is proper if and only if there exists a convex, real valued function G such that

E y q r(p, y) = E y q[DG(δy, p) G(δy)].

Where δy := ey is the indicator vector for coordinate y.

For the purposes of illustrating the role of the Jensen gap, we show implementation of the divergences DG(p, q), which have the arguments in the opposite order to the version implemented by proper losses. Later, we show how to implement the usual ordering of the arguments, DG(p, q). We begin with the RBB case for clarity of exposition.

D.1. Jensen Gap

Lemma D.1.1. If DG(p, q) is a Bregman divergence then it can be RBB-implemented, provided that an unbiased estimator, δ, exists for G(p).

L(ˆp, q) = DG(ˆp, q) [G(ˆp) δ(ˆp)]

= G(ˆp) [G(q) + G(q), ˆp q ] [G(ˆp) δ(ˆp)]

= δ(ˆp) [G(q) + G(q), ˆp q ]

Note that E[G(ˆp) δ(ˆp)] = E DG(ˆp, p), the Jensen gap. This can be interpreted as the expected additional distance the randomness of ˆp adds.

Proof. Begin with the law of cosines for Bregman divergences and take the expectation of both sides.

E H pn DG(ˆp, q) = E DG(ˆp, p) + DG(p, q) E ˆp p, G(q) G(p)

= E DG(ˆp, p) + DG(p, q) 0 (1)

= E h G(ˆp) [G(p) + G(p), ˆp p ] i + DG(p, q)

= E[G(ˆp)] [G(p) + G(p), E ˆp p ] + DG(p, q)

= E[G(ˆp)] G(E ˆp) 0 + DG(p, q). (2)

Let us note several things here. First, line (1) formalizes the intuition we have outlined in the lemma. Second we also clearly see that the expected Bregman divergence between ˆp and p, E ˆp pn DG(ˆp, p), is exactly the Jensen gap, E[G(ˆp)] G(E ˆp), as

exhibited by the resulting expression in (2). Hence, rearranging for clarity we see that

DG(p, q) = E H pn DG(ˆp, q) DG(ˆp, p)

= E DG(ˆp, q) E G(ˆp) G(E ˆp) .

Proper Losses for Discrete Generative Models

Which gives us the RBB-implementing loss for DG if we have an unbiased estimator δ where E δ(ˆp) = G(E p) = G(p).

Example D.1.1. As an example, let s look at the Jensen Gap for G(x) = x 2.

E H pn G(ˆp) G( E H pn ˆp) = E[ ˆp 2] E ˆp 2

X E[ˆp2 x] E[ˆpx]2

X V ar(ˆpx).

Lemma D.1.2. If DG(p, q) is a Bregman divergence then an equivalent divergence can be BB-implemented, provided that an unbiased estimators δ, and δ where E δ(ˆp) = G(p), and E δ (ˆq) = G(q) exist.

L(ˆp, ˆq) = δ(ˆp) δ(ˆq) ˆp, δ (ˆq) .

Hq qm L(ˆp, ˆq) = E h δ(ˆp) δ(ˆq) ˆp, δ (ˆq) i

= G(p) G(q) p, G(q)

= G(p) [G(q) + G(q), p q ] q, G(q)

= DG(p, q) q, G(q) .

Note that the divergence implemented by the above loss, DG(p, q) q, G(q) , when considered from the candidate distributions point of view, is merely the original divergence, DG(p, q), minus a constant. Thus to the candidate model, the loss is equivalent up to a constant to the original Bregman divergence. The same idea is used for classical proper losses as exhibited by Theorem D.0.1.

D.2. Implementing DG(q, p)

We will implement the same equivalent divergence as in Theorem D.0.1:

DG(q, p) G(q) = G(p) G(p), p + G(p), q .

Notice that this equivalent divergence can be implemented if each additive term can be unbiasedly estimated on its own. Thus we will need unbiased estimators for q, G(p), G(p), and p, G(p) . In the case of deterministic or Poisson sampling, if G(p) can be unbiasedly estimated then G(p) and p, G(p) can be unbiasedly estimated. Estimating q is easy.

We will use the characterization of deterministic sampling and Poisson unbiasedly estimable functions. The following is applicable to both Poisson and deterministic sampling.

Suppose G(p) is unbiasedly estimable. Then G(p) = P

x X pjk,x x where |K| is possibly infinite. Then

k K:jk,y =0 ak jk,ypjk,y 1 y Y

x =y pjk,x x

Proper Losses for Discrete Generative Models

is an (infinite) polynomial, thus it is unbiasedly estimable. Note that if {k K : jk,y = 0} = then G(p)y = 0, for all p. Furthermore, p, G(p) = X

is an also an (infinite) polynomial, so unbiasedly estimable.

One can view the following results as corollaries to the characterization of Poisson implementable divergences (Corollary 10). Lemma D.2.1. Let r be a Poisson sampling scheme and t be any sampling scheme that can estimate q. If DG(q, p) is a Bregman divergence then an equivalent divergence can be (r, t)-implemented if and only if G has an equivalent power series expression in the first and second arguments with non-negative integer powers and the power series satisfies 1) every coefficient of the first and second arguments is finite and 2) if the series diverges for any argument, the proper divergence also diverges in the same direction (goes to + or ). Corollary D.2.1. Let r be a deterministic size sampling scheme drawing n samples and t be any sampling scheme that can estimate q. If DG(q, p) is a Bregman divergence then an equivalent divergence can be (n, t)-implemented if and only if G is a polynomial with degree less than or equal to n.

E. General Sampling Schemes

Definition E.0.1. A generic-black-box (GBB) loss is a function L : NX NX R where L(hp, hq) is the loss assigned to histogram hp of samples drawn from the model on histogram hq of samples drawn from the target distribution.

The difference between a GBB loss and a BB loss (Definition 4.1) is that we allowed BB losses to be a function of histograms of a specific, predetermined size (n and m). In contrast, a GBB loss must be defined for histograms of any size. These functions can also compute N, the sample size. Definition E.0.2. A sampling scheme r is a stopping rule for the process of drawing observations from a black-box generative model. The stopping rule may depend on the history of the seen observations and may also use randomness. Definition E.0.3. Let r, t be sampling schemes for the report and the target distribution, respectively. A generic-black-box loss L is (r, t)-black-box proper if L(p, q) := E r,t L(Hp, Hq) is a proper divergence. If ℓis some proper divergence and

there exists L such that L = ℓ, we will say that L (r, t)-implements ℓand that ℓis (r, t)-implementable.

Given a characterization of the U-estimable functions under certain sampling schemes, we can construct the set of implementable proper divergences. We can also construct the respective implementing losses from these characterizations. We do not investigate the sample complexity of the schemes or define minimally-implementable in the generic setting. While one could consider ordering generic sampling schemes by e.g. expected number of ramples drawn, the most reasonable ordering of sampling schemes is not always clear, and we leave such investigations to future work. Theorem E.0.1. Let r, t be sampling schemes. Let T be the set of all proper distances and let

Fr = {ℓ( , ) : ℓis unbiasedly estimable in the first argument under sampling scheme r}

Ft = {ℓ( , ) : ℓis unbiasedly estimable in the second argument under sampling scheme t}.

Then the set of all (r, t)-implementable proper divergences is

BBr,t = T Fr Ft

Proof. If we can characterize Fr and Ft then we have a characterization of the unbiasedly estimable functions under sampling schemes r and t, respectively. These characterizations must provide constructions of the unbiased estimators. Thus we can construct an unbiased estimator for each ℓ Fr Ft. Hence ℓ BBr,t is implementable and a proper divergence, by definition.

F. Omitted results from section 7

For this section we consider densities on continuous domains. For a density p over R, Fp( ) is the CDF of p.

Definition F.0.1 (Empirical CDF). Given a sample s = {Xi}n i=1 the empirical CDF is defined as Fs(x) := |i:Xi x|

Proper Losses for Discrete Generative Models

F.1. Implementation of the Cram er Distance

Corollary F.1.1. For densities p, q over [0, 1], let s and u be samples drawn from p and q, respectively. Then the loss

L(s, u) = Z

(Fs(x) Fu(x))2 s2 |s|(Fs(x)) s2 |u|(Fu(x)) dx

(2, 2)-minimally-implements the Cram er distance, ℓ(p, q) = R

R(Fp(x) Fq(x))2dx (Cram er, 1928).

Proof of Corollary F.1.1. Notice that FS(x) = |i:Xi x|

n is distributed according to 1

n Bin(n, Fp(x)).

E S,U L(S, U) = E S,U

(FS(x) FU(x))2 s2 |S|(FS(x)) s2 |U|(FU(x)) dx

E[FS(x)2] 2 E FU(x) E FS(x) + E[FU(x)]2 V ar(FS(x)) V ar(FU(x))dx

E[FS(x)]2 2Fq(x)Fp(x) + E[FU(x)]2dx

(Fp(x) Fq(x))2dx.

Where the second equality is by the independence of S and U and the previously defined variance estimator for the binomial distribution (claim A.2.1). The third equality is by expanding expectation of the squared term and reducing (the second non-centered moment of a binomial). One could also prove this using the technique from the proof of claim 3.

The energy distance in one dimension is equivalent to twice the Cram er distance. Thus the energy distance also gives a loss that implements the Cram er distance. See appendix G for a discussion of the relationships between different types of losses in the continuous setting

F.2. High Dimensional Extension of the Cram er Distance

Let us now work in a continuous domain where the samples are from Rj. Now instead of distributions p, q X we will have densities p, q on Rj. The desired score will again be the Harald Cram er distance. However now we will define a CDF with respect to a direction and then integrate over all directions.

Definition F.2.1. (Generalized CDF). Let Y be a random variable taking values in Rj, p be the associated density, and v Rj such that v = 1. Then the direction v CDF of Y is

F v p (x) = Pr[ v, Y + x 0].

The distance analogous to the Harald Cram er distance is then

v Rj ||v||=1

(F v p (x) F v q (x))2dx dv.

Now to create a sample proper loss we may again introduce a variance correction term as before. However, we also note that if we pick a random direction v, then we would not have to integrate over all v since the expectation of the distance under a random v is the same as the deterministic distance. Below we show the RBB loss, however the result for the BB version is very similar; one can compare between corollary F.1.1 and the following.

Proper Losses for Discrete Generative Models

Claim F.2.1. Let p, q be densities over Rj and s be the sample drawn from p. Then the following loss is RBB proper. First pick a random unit vector v Rj then

L(s, q) = Z

R (F v s (x) F v q (x))2 s2 n(F v s (x)) dx.

In other words, L implements ℓ(p, q) = R

v Rj ||v||=1

R(F v p (x) F v q (x))2dx dv.

Proof. Let S = (X1, X2, . . . , Xn).

E v Sj 1 E S L(ˆp, q) = E v Sj 1 E S

(F v S(x) F v q (x))2 s2 n(F v p (x)) dx

v Rj ||v||=1

(F v S(x) F v q (x))2 s2 n(F v p (x)) dx dv

v Rj ||v||=1

E[(F v S(x) F v q (x))2 s2 n(F v p (x))] dx dv

v Rj ||v||=1

Fp(x)2 + F v p (x)(1 F v p (x)) n 2F v q (x)F v p (x) + F v q (x)2

F v p (x)(1 F v p (x)) n dx dv

v Rj ||v||=1

Fp(x)2 2F v q (x)F v p (x) + F v q (x)2 dx dv

v Rj ||v||=1

(F v p (x) F v q (x))2dx dv.

Let C Bin(n, F v p (x)). Once again note that F v S(x) = 1

n|{Xi S : v, Xi + x 0}| = 1

n C. Hence we expand the expectation with the first and second moment as in Claim F.1.1.

G. Discussion of other continuous losses

We discuss our results in the previous section in relation to two other methods of generative model evaluation in the continuous setting. Our results rely on computing losses based on the empirical CDF whether in one or many dimensions.

First, unless estimation/smoothing is done on the empirical density, it is not possible to work with losses that integrate over a function of the two densities at every point in the outcome space. There is a large body of work on density estimation for evaluating generative models. However, losses based on kernel density estimation are beyond the scope of this work.

Second, one can trivially construct proper losses based on functions of the random variables associated with densities p and q. For example the energy distance is

D2(F, G) = 2 E X Y E X X E Y Y .

Where X, X and Y, Y are independent copies of the random variable associated with density p and q, respectively.

The number of independent copies of a random variable in the expression is exactly the number of independent samples from that random variable required to unbiasedly estimate the loss. For the energy distance, we need 2 independent samples from p and q each. In one dimension these can also be written as functions of the empirical CDF.

Proper Losses for Discrete Generative Models

G.1. Connection to energy distance in one dimension

In the one dimensional continuous setting, we have densities p, q on R. We repeat the proof that the energy distance is equal to twice the Cram er distance. We can see from our approach or the form of the energy distance that this loss is (2, 2)-minimally implementable.

Lemma G.1.1. (Sz ekely & Rizzo, 2005) Let X, X be i.i.d. with CDF F(x) and Y, Y be i.i.d. with CDF G(y). Then the energy distance in one dimension is equal to twice the Cram er distance.

2 E |X Y | E |X X | E |Y Y | = 2 Z

(F(x) G(x))2dx.

Proof. We will convert the energy distance into the Cramer distance. First we use the identity

1(X u < Y ) + 1(Y u < X)du

Now let A = E |X Y |, B = E |X X |, C = |Y Y |. We then use Fubini s theorem.

A = E |X Y |

1(X u < Y ) + 1(Y u < X)dudxdy

1(X u < Y ) + 1(Y u < X)dxdydu

Pr[X u] Pr[Y > u] + Pr[X > u] Pr[Y u]du

F(u)(1 G(u)) + (1 F(u))G(u)du

F(u) 2F(u)G(u) + G(u)du

Hence by similar derivation, B = R

R 2F(u) 2F(u)2du and C = R

R 2G(u) 2G(u)2du. The lemma follows by simple algebra.

G.2. Connection to the CRPS

We derived the Cram er distance via extending the Continuously Ranked Probability Score from the proper scoring rules literature. Intuitively, one can think CRPS as evaluating a distribution against an empirical distribution consisting of a single sample (Gneiting & Raftery, 2007). Let Fr be the CDF of a density r and again p be the reported distribution and q the true distribution. Then the CRPS (in terms of a loss to be minimized) for a outcome particular y drawn from the density q is

(Fp(x) 1{x y})2dx = Z

(Fp(x) Fˆq(x))2dx. (3)

Where ˆq(x) = Hq is the empirical distribution of the data consisting of the single sample y. Note that Fˆq(x) is 0 below y and 1 when greater than or equal to y. Hence Fˆq(x) = 1{x y}. It is easy to see from the form of the CRPS that CRPS is also (2, 1)-minimally-implementable since the LHS of (3) contains a polynomial of degree 2 in Fp and it requires only

Proper Losses for Discrete Generative Models

1 sample from q. There also exists a form of the CRPS derived from the form of the equivalent energy distance that also shows (2, 1)-minimal-implementability (Gneiting & Raftery, 2007).

To extend CRPS to our setting, in which we have an empirical densities for ˆp and ˆq, we derived

ℓ(p, q) = Z 1

0 (Fp(x) Fq(x))2dx.

Which is the Harald Cram er distance (Cram er, 1928). The CRPS is a special case when we draw only 1 sample from q. We give a BB loss that implements this distance in claim F.1.1.

H. Omitted Experiments

H.1. Distribution Definitions

Definition H.1.1 (Spiked Uniform).

( 0.1 if x {1, 2, 3, 4, 5} 1 0.5

K 5 otherwise

Definition H.1.2 (Spiked Zipfian(r)). Let zx be the probability mass of x in a Zipfian(r) distribution over an outcome space {1, . . . , K}

( 0.05 1.15 if x {5, 10, 20} zx 1.15 otherwise

Proper Losses for Discrete Generative Models

H.2. Absolute Deviation

(a) Squared Distance, p: Zipfian(1), q: Zipfian(2) (b) Cross Entropy, p: Zipfian(1), q: Zipfian(2)

(c) Squared Distance, p: Uniform, q: Zipfian(0.01) (d) Cross Entropy, p: Uniform, q: Zipfian(0.01)

(e) Squared Distance, p: Uniform, q: Spiked Uniform (definition H.1.1)

(f) Cross Entropy, p: Uniform, q: Spiked Uniform (definition H.1.1)

Proper Losses for Discrete Generative Models

(g) Squared Distance, p: Zipfian(1), q: Spiked Zipfian(1) (definition H.1.2)

(h) Cross Entropy, p: Zipfian(1), q: Spiked Zipfian(1) (definition H.1.2)

Figure 2. The y-axis is the absolute error deviation the black-box loss value and the true distances (Squared distance and Cross Entropy). K = 10, 000 for all trials. 30 trials for each parameter setting was recorded (including batch size). The horizontal bars represent the maximum absolute deviation of any of the 30 trials. The solid markers represent the average of the trials. Squared distance was estimated using the loss from claim 3. Cross entropy was estimated with a poisson sampling loss. Batch sizes were the same between draws from p and q.

Proper Losses for Discrete Generative Models

H.3. Two-sided deviation

(a) Squared Distance, p: Zipfian(1), q: Zipfian(2) (b) Cross Entropy, p: Zipfian(1), q: Zipfian(2)

(c) Squared Distance, p: Uniform, q: Zipfian(0.01) (d) Cross Entropy, p: Uniform, q: Zipfian(0.01)

(e) Squared Distance, p: Uniform, q: Spiked Uniform (definition H.1.1)

(f) Cross Entropy, p: Uniform, q: Spiked Uniform (definition H.1.1)

Proper Losses for Discrete Generative Models

(g) Squared Distance, p: Zipfian(1), q: Spiked Zipfian(1) (definition H.1.2)

(h) Cross Entropy, p: Zipfian(1), q: Spiked Zipfian(1) (definition H.1.2)

Figure 3. The y-axis is the deviation between the black-box loss value and the true distances (Squared distance and Cross Entropy). K = 10, 000 for all trials. 30 trials for each parameter setting was recorded (including batch size). The horizontal bars represent the maximum deviation on either side of any of the 30 trials. The solid markers represent the average of the trials. Squared distance was estimated using the loss from claim 3. Cross entropy was estimated with a poisson sampling loss. Batch sizes were the same between draws from p and q.

Proper Losses for Discrete Generative Models

H.4. Normalized Absolute Deviation

(a) Squared Distance, p: Zipfian(1), q: Zipfian(2) (b) Cross Entropy, p: Zipfian(1), q: Zipfian(2)

(c) Squared Distance, p: Uniform, q: Zipfian(0.01) (d) Cross Entropy, p: Uniform, q: Zipfian(0.01)

Proper Losses for Discrete Generative Models

(e) Squared Distance, p: Uniform, q: Spiked Uniform (definition H.1.1)

(f) Cross Entropy, p: Uniform, q: Spiked Uniform (definition H.1.1)

(g) Squared Distance, p: Zipfian(1), q: Spiked Zipfian(1) (definition H.1.2)

(h) Cross Entropy, p: Zipfian(1), q: Spiked Zipfian(1) (definition H.1.2)

Figure 4. The y-axis is the normalized absolute deviation between the black-box loss value and the true distances (Squared distance and Cross Entropy). K = 10, 000 for all trials. The deviations are normalized by the true distance between p and q. 30 trials for each parameter setting was recorded (including batch size). The horizontal bars represent the maximum normalized absolute deviations of any of the 30 trials. The solid markers represent the average of the trials. Note that when the squared distance is close to 0, the normalized error becomes very difficult to keep low. The un-normalized error in the previous section is more appropriate in this case. Squared distance was estimated using the loss from claim 3. Cross entropy was estimated with a poisson sampling loss. Batch sizes were the same between draws from p and q.