# entropy_estimation_via_normalizing_flow__ce517bd4.pdf

Entropy Eestimation via Normalizing Flow

Ziqiao Ao, Jinglai Li

School of Mathematics, University of Birmingham {zxa029, j.li.10}@bham.ac.uk

Entropy estimation is an important problem in information theory and statistical science. Many popular entropy estimators suffer from fast growing estimation bias with respect to dimensionality, rendering them unsuitable for high-dimensional problems. In this work we propose a transform-based method for high-dimensional entropy estimation, which consists of the following two main ingredients. First by modifying the k-NN based entropy estimator, we propose a new estimator which enjoys small estimation bias for samples that are close to a uniform distribution. Second we design a normalizing ﬂow based mapping that pushes samples toward a uniform distribution, and the relation between the entropy of the original samples and the transformed ones is also derived. As a result the entropy of a given set of samples is estimated by ﬁrst transforming them toward a uniform distribution and then applying the proposed estimator to the transformed samples. Numerical experiments demonstrate the effectiveness of the method for high-dimensional entropy estimation problems.

Introduction

Entropy is one of the most fundamental concepts in information theory, and has also found vast applications in other disciplines such as physics, statistics and machine learning. For example, in the data science contexts, various applications rely critically on the estimation of entropy, including goodness-of-ﬁt testing (Vasicek 1976; Goria et al. 2005), sensitivity analysis (Azzi, Sudret, and Wiart 2020), parameter estimation (Ranneby 1984; Wolsztynski, Thierry, and Pronzato 2005), and Bayesian experimental design (Sebastiani and Wynn 2000; Ao and Li 2020). As a concept of the average surprisal in a variable s possible outcomes, entropy provides a natural answer to measuring the uncertainty of probability distribution of interest. In this work we focus on the continuous version of entropy that takes the form,

H(X) = Z log[px(x)]px(x)dx, (1)

where px(x) is probability density function of a random variable X. Despite the rather simple deﬁnition, entropy only

Copyright c 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

admits an analytical expression for a limited family of distributions and needs to be evaluated numerically in general. When the distribution of interest is analytically available, in principle its entropy can be estimated by numerical integration schemes such as the Monte Carlo method. However, in many real-world applications, the distribution of interest is not analytically available, and one has to estimate the entropy from the i.i.d. realizations drawn from the target distribution, which makes exact computation of the entropy difﬁcult or even impossible. Entropy estimation has attracted considerable attention from various communities in the last a few decades, and a large number of methods have been developed to directly evaluate entropy from realizations. In this work we only consider non-parametric approaches which do not assume any parametric model of the target distribution, and those methods can be broadly classiﬁed into two categories. The ﬁrst line of methods, known as the plug-in estimators, are to estimate the unknown probability density, and then compute the integral in Eq. (1) using numerical integration or Monte Carlo (see (Beirlant et al. 1997) for a detailed description). Some examples of density estimation approaches that have been studied for plug-in methods are kernel density estimator (Joe 1989; Hall and Morton 1993), histogram estimator (Györﬁand Van der Meulen 1987; Hall and Morton 1993) and ﬁeld-theoretic approach (Chen, Tareen, and Kinney 2018). A major limitation of this type of methods is that they rely on an effective density estimation, which is a difﬁcult problem in its own right, especially when the dimensionality of the problem is high. A different strategy is to directly estimate the entropy from the independent samples of the random variable. Methods following this line include sample-spacing (Miller 2003) and k-nearest neighbors (k-NN) (Kozachenko and Leonenko 1987; Kraskov, Stögbauer, and Grassberger 2004) based estimators. The latter is particularly appealing among the existing estimation methods for its theoretical and computational advantages and has been widely used in practical problems. More recent variants and extensions of the k-NN methods include (Gao, Ver Steeg, and Galstyan 2015; Berrett et al. 2019). Entropy estimation becomes increasingly more difﬁcult as the dimensionality grows, and such difﬁculty is mainly due to the estimation bias, which decays very slowly with respect to sample size for high-dimensional problems. For

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

example in many popular approaches including the k-NN method (Kozachenko and Leonenko 1987), the estimation bias decays at the rate of O(N γ/d) where N is the sample size, d is the dimensionality, and γ is a positive constant (Krishnamurthy et al. 2014; Kandasamy et al. 2015; Gao, Oh, and Viswanath 2018; Sricharan, Wei, and Hero 2013). As a result very few if not none of the existing entropy estimation methods can effectively handle high-dimensional problems without strong assumptions on the smoothness of the underlying distribution ((Kandasamy et al. 2015)).

The main goal of this work is to provide an effective entropy estimation approach which can achieve faster bias decaying rate under mild smoothness assumption, and thus can apply to high-dimensional problems (i.e., ones of 20 dimensions or higher 1). The method presented here consists of two main ingredients. We propose two truncated k-NN estimators based on those by (Kozachenko and Leonenko 1987) and (Kraskov, Stögbauer, and Grassberger 2004) respectively, and also provide the bounds of the estimation bias in these estimators. Remarkably our theoretical results suggest that the estimators achieve zero bias for uniform distributions, while there is no such a result for any existing k-NN based estimators, according to the bias analysis that are available to date (Gao, Oh, and Viswanath 2018; Singh and Póczos 2016; Biau and Devroye 2015). This property offers the possibility to signiﬁcantly improve the performance of entropy estimation by mapping the data points toward a uniform distribution. Therefore the second main ingredient of the method is the normalizing ﬂow (NF) technique (Rezende and Mohamed 2015; Papamakarios et al. 2021) which constructs a sequence of invertible and differentiable mappings that transform a simple base distribution such as standard Gaussian into a more complicated distribution whose density function may not be available. Speciﬁcally we use the Masked Autoregressive Flow (Papamakarios, Pavlakou, and Murray 2017), a NF algorithm originally developed for density estimation, combined with the probability integral transform, to push the original data points towards the uniform distribution. We then estimate the entropy of the resulting near-uniform data points with the proposed truncated k-NN estimators, and derive that of the original ones accordingly (by adding an entropic correction term due to the transformation). Therefore, by combining the truncated k-NN estimators and the normalizing ﬂow model, we are able to decode a complex high-dimensional distribution represented by realizations, and obtain an accurate estimation of its entropy. Finally, we provide several complex high-dimensional distributions to demonstrate the performance of the proposed scheme and apply it to Bayesian experimental design problems.

1We note that in many statistical applications, problems of 20 dimensions are not regarded as high-dimensional . However, the well-known minimax bias results (e.g., (Han et al. 2020; Birgé and Massart 1995)) indicate that without the strong smoothness assumption ((Kandasamy et al. 2015)), the curse of dimensionality is inevitable, and as a result problems of 20 dimensions or higher are deemed high-dimensional in entropy estimation.

k-NN Based Entropy Estimation We provide a brief introduction to two traditional k-NN based entropy estimators in this section. We start with the original k-NN entropy estimator proposed by (Kozachenko and Leonenko 1987), where the k-th nearest neighbor is contained in the smallest possible closed ball. Next, we introduce a popular variant of the k-NN estimator proposed in (Kraskov, Stögbauer, and Grassberger 2004), and this method uses the smallest possible hyper-rectangle to cover at least k points. We ﬁnally discuss some theoretical analysis of estimation errors in the estimators.

Kozachenko-Leonenko Estimator Recall the deﬁnition of entropy in Eq. (1). Given a density estimator bpx(x) for px(x) and a set of N i.i.d. samples S = {x(i)}N i=1 drawn from px(x), the entropy of the random variable X can be estimated as follows:

b H(X) = N 1 N X

i=1 log bpx(x(i)). (2)

The Kozachenko-Leonenko (KL) estimator depends on a local uniformity assumption to obtain the estimate bpx(x). For each x(i), one ﬁrst identiﬁes the k-nearest neighbors (in terms of the p-norm distance) of it, and deﬁnes the smallest closed ball covering all these k neighbors as:

B(x(i), ϵi/2) = {x Rd x x(i) p ϵi/2},

where ϵi be twice the distance between x(i) and its k-th nearest neighbor among the set S. We shall refer to the closed ball B(x(i), ϵi/2) as a cell centered at x(i), and let qi be the mass of the cell B(x(i), ϵi/2) , i.e.,

x B(x(i),ϵi/2) px(x)dx.

It can be derived that the expectation value of log qi over ϵi is given by E(log qi) = ψ(k) ψ(N), (3)

where ψ(x) = Γ (x)

Γ(x) with Γ(x) being the Gamma function (Kraskov, Stögbauer, and Grassberger 2004). KL estimator then assumes that the density is constant in B(x(i), ϵi), which gives qi(ϵi) cdϵd i px(x(i)), (4) where d is the dimension of X and

cd = Γ(1 + 1

p)d/Γ(1 + d

is the volume of the d-dimensional unit ball with respect to p-norm. Combining (3) and (4) one can get an estimate of the log-density at each sample point,

log bpx(x(i)) = ψ(k) ψ(N) log cd d log ϵi. (5)

Plugging the above estimates for i = 1, ..., N into (2) yields the KL estimator:

b HKL(X) = ψ(k) + ψ(N) + log cd + d

i=1 log ϵi. (6)

KSG Estimator As is mentioned earlier, the Kraskov-Stögbauer-Grassberger (KSG) estimator is an important variant of ˆHKL. Unlike KL estimator that is based on closed balls, KSG estimator uses hyper-rectangles to form the cells at each data point. Namely one chooses the -norm as the distance metric (i.e p = ), and as a result the cell B(x(i), ϵi/2) becomes a hyper-cube with side length ϵi. Next, we allow the hyper-cube to become a hyper-rectangle: i.e., the cells admit different side lengths along different dimensions. Speciﬁcally, for j = 1, ..., d, we deﬁne ϵi,j to be twice of the distance between x(i) and its kth nearest neighbor along dimension j, and the cell centered at x(i) covering its k-nearest neighbors becomes

B(x(i), ϵi,1:d/2) = {x = (x1, ..., xd) | |xj x(i) j | ϵi,j/2,

for j = 1, ..., d}, (7)

where ϵi,1:d = (ϵi,1, ..., ϵi,d). This change leads to a different formula for computing the mass of the cell B(x(i), ϵi,1:d/2),

E(log qi) ψ(k) d 1

k ψ(N). (8)

It is worth noting that the equality in Eq. (3) is replaced by approximate equality in Eq. (8), because a uniform density within the rectangle has to be assumed to obtain Eq. (8) (see Lemma 2 in the supplementary information (SI) for details). Using a similar local assumption as Eq. (4), the KSG estimator is derived as,

b HKSG(X) = ψ(k)+ψ(N)+ d 1

j=1 log ϵi,j.

(9) We note that the KSG method was actually developed in the context of estimating mutual information (Kraskov, Stögbauer, and Grassberger 2004), and has been reported to outperform the KL estimator in a wide range of problems (Gao, Oh, and Viswanath 2018). As has been shown above, it is straightforward to extend it to entropy estimation, and our numerical experiments also suggest that it has competitive performance as an entropy estimator, which will be demonstrated in the numerical experiments.

Convergence Analysis Another important issue is to analyze the estimation errors in these entropy estimators and especially how they behave as the sample size increases. In most of the k-NN based estimators including the two mentioned above, the variance is generally well controlled, decaying at a rate of O(N 1) with N being the sample size, while the main issue lies on the estimation bias. In fact, the bias of estimator b HKL has been well studied, but that of b HKSG receives very little attention. Previous results related to the former are listed as follows. The original (Kozachenko and Leonenko 1987) paper established the asymptotic unbiasedness for k = 1 while (Singh et al. 2003) obtained the same result for general k. For distributions with unbounded support, (Tsybakov and

Figure 1: The schematic illustration of the truncated estimator. The shaded area is that removed from the k-NN cell.

Van der Meulen 1996) proved that the bias bound decays at a rate of O( 1

N ) for d = 1. (Gao, Oh, and Viswanath 2018) generalized it to higher dimensions, obtaining a bias bound of O(N 1

d ) up to polylogarithmic factors. For distributions compactly supported, usually densities satisfying the β-Hölder condition are considered. (Biau and Devroye 2015) gave a quick-and-dirty upper bound of bias, O(N β), for a simple class of univariate densities supported on [0, 1] and bounded away from zero. (Singh and Póczos 2016) proved the bias is around O(N β

d ) (β (0, 2]) for general d with some additional conditions on the boundary of support. We reinstate that all these works obtained a variance bound of O(N 1). It should be noted that the bias bounds given by previous studies typically depend on some properties of target densities, such as smoothness parameter and Hessian matrix, providing insights that these estimators perform well on certain distributions that satisfy certain conditions. This motivates the idea that one can transform the given data points toward a desired distribution for a more accurate entropy estimation, which is detailed in next section.

Uniformizing Mapping Based Entropy Estimation In this section, we shall present an entropy estimation approach that is based on normalizing ﬂow. As is mentioned earlier, it consists of two main ingredients: a truncated version of the k-NN entropy estimators, and a transformation that can map data points toward a uniform distribution.

Truncated KL/KSG Estimators For compactly supported distributions, a signiﬁcant source of bias comes from the boundary of the support, where the k-NN cells are constructed including areas outside of the support of the distribution density (Singh and Póczos 2016). Intuitively speaking, incorrectly including such areas results in an underestimate of the densities, leading to bias in the estimator. We thus propose a method to reduce the estimation bias by excluding the areas outside of the distribution support, and remarkably the resulting estimator enjoy certain convergence properties which enable us to design the NF based estimation approach. The only additional requirement for using these estimators is that the bound of support of density

should be speciﬁed. Without loss of generality, we suppose the target density is supported on the unit cube Q := [0, 1]d in Rd. The procedure of our method is as follows: we ﬁrst determine all the cells using either KL or KSG, then examine whether each k-NN cell covers area out of the distribution support, and if so, truncate the cell at the boundary to exclude such area. Mathematically the truncated KL (t KL) estimator (with -norm), is given by

b Ht KL(X) = ψ(k) + ψ(N) + 1

j=1 log ξi,j, (10)

ξi,j = min{x(i) j + ϵi/2, 1} max{x(i) j ϵi/2, 0};

and the truncated KSG (t KSG) esitmator is given by

b Ht KSG(X) = ψ(k) + ψ(N) + (d 1)/k

j=1 log ζi,j, (11)

ζi,j = min{x(i) j + ϵi,j/2, 1} max{x(i) j ϵi,j/2, 0}.

To numerically demonstrate the improvement of the truncated estimators over the conventional version, we compare their performances on multidimensional Beta distributions with various shape parameters, and the results are reported in the SI. Next we shall theoretically analyze the bias of the truncated estimators. Our analysis relies on some assumptions on the density function px, which are summarized as below: Assumption 1. The distribution px satisﬁes: (a) px is continuous and supported on Q; (b) px is bounded away from 0, i.e., C1 = inf x Q px(x) > 0;

(c) The gradient of px is uniformly bounded on Qo, i.e., C2 = sup x Qo || px(x)||1 < .

First we consider the bias of estimator b Ht KL and the following theorem states that, the bias in b Ht KL is bounded and vanishes at the rate of O(N 1

d ). Theorem 1. Under Assumption 1 and for any ﬁnite k and d, the bias of the truncated KL estimator is bounded by E[ b Ht KL(X)] H(X) C2 C1+1/d 1

The variance of the truncated KL estimator is bounded by

Var[ b Ht KL(X)] C 1

for some C > 0.

Proof. See the SI.

Note that C2 = 0 when px is uniform on Q, and the following corollary follows directly:

Corollary 1. Under the assumption in Theorem 1, if X is uniformly distributed on Q, then the truncated KL estimator is unbiased. This corollary is the theoretical foundation of the proposed method, as it suggests that if one can transform the data points into a uniform distribution, the t KL method can yield an unbiased estimate. In reality, it is usually impossible to map the data point exactly into a uniform distribution to achieve the unbiased estimate. To this end, Theorem 1 suggests that, as long as the transformed samples are close to a uniform distribution in the sense that C2 is small, the transformation can still signiﬁcantly reduce the bias. Since the main contribution of the mean-square estimation error comes from the bias (as the variance decays at the rate of O(N 1)), reducing the bias therefore leads much more accurate estimation of the entropy. We next consider the bias of the t KSG estimator. The second theorem shows that the expectation of b Ht KSG has the same limiting behavior up to a polylogarithmic factor in N. Theorem 2. Under Assumption 1 and for any ﬁnite k and d, the bias of the truncated KSG estimator is bounded by E[ b Ht KSG(X)] H(X) C (log N)k+2

Ck+1 1 N 1 d

for some C > 0. The variance of the truncated KSG estimator is bounded by

Var[ b Ht KSG(X)] C (log N)k+2

for some C > 0.

Proof. See the SI.

As one can see from Theorem 2, while the uniform distribution leads to zero bias for b Ht KL, we can not obtain the same result for b Ht KSG, which means no theoretical justiﬁcation for mapping the data points toward a uniform distribution for this estimator. That said, the t KSG estimator and Theorem 2 are still useful, and the reason for that is two-fold. First as is mentioned earlier, no existing result on the bound of bias is available for the KSG estimator to the best of our knowledge, and to this end our analysis on t KSG is the ﬁrst known bias bound for this type of estimators, and may provide useful information for understanding the convergence property of them. More importantly, our numerical experiments demonstrate that mapping the data points toward a uniform distribution does signiﬁcantly improve the performance of t KSG as well. In fact, we have found that t KSG can achieve the same or slightly better results than t KL on the transformed samples in our test cases.

Estimating Entropy via Transformation As is mentioned earlier, based on the interesting convergence properties of the truncated estimators in particularly t KL, we want to estimate the entropy of a given set of samples by mapping them toward a uniform distribution. To implement this idea, an essential question to ask is that, how the entropy of the transformed samples relates to that of the original ones. Proposition 1 provides an answer to this question.

Proposition 1 ((Ihara 1993)). Let f be a mapping: Rd Rd, X be random variable deﬁned on Rd following distribution px, and Z = f(X). If f is bijective and differentiable, we have

H(X) = H(Z) + Z pz(z) log det f 1(z)

where pz(z) is the distribution of Z.

Therefore given a data set S = {x(i))}N i=1 and a mapping Z = f(X), from Eq. (12) we can construct an entropy estimator of X as,

b H(X) = b H(Z) + 1

i=1 log det f 1(z(i))

where b H(Z) is an entropy estimator of Z (either t KL or t KSG) based on the transformed samples SZ = {z(i) = f(x(i))}n i=1. We refer to such a mapping f( ) as a uniformizing mapping (UM) and the resulting method as a UM based entropy estimator. A central question in implementation of the algorithm is obviously how to construct a UM which can push the samples toward a uniform distribution, which is discussed in next section. The bias of the UM based estimator relies on the property of the UM (or equivalently the NF), on which we make the following assumption:

Assumption 2. Let S = {x(i)}N i=1 be the set of i.i.d samples used to construct the UM and p S z be the resulting density of Z in Eq. (13). Denote CN 2 = sup z Qo || p S z (z)||1, and assume

that CN 2 satisﬁes: (1) CN 2 P N 0; (2) There exist a positive

integer M and a positive real number C < 1 such that:

N > M, CN 2 C, a.s.

Based on Theorem 1 and Theorem 2, we can obtain a bound of the bias of the UM based estimator.

Corollary 2. Suppose that the density function of the original distribution is differentiable and the UM satisﬁes Assumption 2. The bias of UM-t KL estimator is bounded by

E[ b HUM t KL(X)] H(X) CN UM t KL k

where lim N CN UM t KL = 0, and the bias of UM-t KSG esti-

mator is bounded by

E[ b HUM t KSG(X)] H(X) CUM t KSG (log N)k+2

where CUM t KSG = C (1+ C) (1+ C)d+1

(1 C)k+1 and C is a positive constant.

Proof. See the SI.

Constructing UM via Normalizing Flow We discuss in this section how to construct a UM via the NF method. First since the image of f is [0, 1]d, we assume that f is in the form of f = Φ g where g : Rd Rd and Φ : Rd [0, 1]d is prescribed. Recall that pz is the distribution of Z = f(X) with X following px, and we want the function g by minimize the Kullback-Leibler divergence (KLD) between pz and the uniform distribution pu:

min g ΩD(pz|pu) := Z pz(z) log pz(z)

where z = Φ g(x) and Ωis a suitable function space. Solving Eq. (16) directly poses some computational difﬁculty as the calculation involves the function Φ, the choice of which may affect the computational efﬁciency. To simplify the computation, we recall the following proposition: Proposition 2 ((Papamakarios et al. 2021)). Let T : Y Z be a bijective and differentiable transformation, and pz(z) be the distribution obtained by passing py(y) through T. Then the equality

D(πy(y)||py(y)) = D(πz(z)||pz(z)) (17)

holds. We now construct the mapping Φ with the cumulative distribution function of the standard normal distribution, a technique known as the probability integral transform, yielding, for a given y Rd,

Φ(y) = (φ1(y1), ..., φd(yd)), φi(yi) = 1

2(1 + erf( y

where erf( ) is the error function. It should be clear that if y follows a standard normal distribution, z = Φ(y) follows a uniform distribution in [0, 1]d, and vice versa. Now applying Proposition 2, we can show that Eq. (16) is equivalent to

min g ΩD(py(y)|q(y)), (18)

where y = g(x) follows distribution py( ) and q( ) is the standard normal distribution. Now assume that g( ) is invertible and let its inverse be h = g 1. We also assume that both g and h are differentiable. Applying Proposition 2 to Eq. (18) with T = h, we ﬁnd that Eq. (18) is equivalent to

min h Ω 1 D(px(x)|qh(x)), (19)

where Ω 1 = {g 1|g Ω} and qh is the distribution obtained by passing q through the mapping h:

qh(x) = q h 1(x) det h 1

Eq. (19) essentially says that we want to push a standard normal distribution q toward a target distribution px, and therefore solving Eq. (19) falls naturally into the framework of NF, the details of which are provided in SI. Once the mapping h( ) (or equivalently g 1( )) is obtained, it can be inserted directly into Algorithm 1 to estimate the sought entropy. In practice, the samples are split into two sets, where one of them is used to construct the UM and the other is used to estimate the entropy.

10 20 30 40 d

UM-t KL UM-t KSG KL KSG

0.5 1 1.5 2 2.5 3 N 104

UM-t KL UM-t KSG KL KSG

Figure 2: Left: RMSE plotted against the dimensionality d. Right: RMSE (on a logarithmic scale) plotted against the sample size N.

Numerical Experiments Multivariate Normal Distribution

To validate the idea of UM based entropy estimator, a natural question to ask is that how it works with a perfect NF transformation, that yields exactly normally distributed samples. To answer this question, we ﬁrst conduct the numerical tests with the standard multivariate normal distribution, corresponding to the situation that one has done a perfect NF. Speciﬁcally we test the four methods: KL, KSG, UM-t KL and UM-t KSG, and we conduct two sets of tests: in the ﬁrst one we ﬁx the sample size to be 1000 and vary the dimensionality, while in the second one we ﬁx the dimensionality to be 40 and vary the sample size. All the tests are repeated 100 times and the Root-mean-square-error (RMSE) of the estimates are calculated. In Fig. 2 (left), we plot the RMSE (on a logarithmic scale) as a function of the dimensionality. One can see from this ﬁgure that, as the dimensionality increases, the estimation error in KL and KSG grows significantly faster than that in the two UM based ones, with the error in KL being particularly large. Next in Fig. 2 (right) we plot the RMSE against the sample size N (note that the plot is on a log-log scale) for d = 40, which shows that for this high-dimensional case, the two UM based estimators yield much lower and faster-decaying RMSE than those two estimators on the original samples. Overall these results support the theoretical ﬁndings that the estimation error can be signiﬁcantly reduced by mapping the target samples toward a uniform distribution.

Multivariate Rosenbrock Distribution

In this example we shall see how the proposed method performs when NF is included. Speciﬁcally our example is the Rosenbrock type of distributions the standard Rosenbrock distribution is 2-D and widely used as a testing example for various of statistical methods. Here we consider two high-dimensional extensions of the 2-D Rosenbrock (Pagani, Wiegand, and Nadarajah 2019): the hybrid Rosenbrock (HR) and the even Rosenbrock (ER) distributions. The details of the two distributions including their density functions are provided in SI. The Rosenbrock distribution is strongly non-Gaussian, and that can be demonstrated by Fig. 3 (left) which shows the samples drawn from 2-D Rosenbrock. As a comparison, Fig. 3 (right) shows the samples that have been transformed toward a uniform distribution and used in entropy estimation. In this example we compare the performance of seven

20 Original samples

1 UM-transformed samples

Figure 3: Left: the original samples drawn from a 2-D Rosenbrock distribution; Right: the UM-transformed samples used in the entropy estimation.

estimators: in addition to the four used in the previous example, we include an estimator only using NF (details in SI) as well as two state-of-the-art entropy estimators: CADEE (Ariel and Louzoun 2020) and the von-Mises based estimator (Kandasamy et al. 2015). First we test how the estimators scale with respect to dimensionality, where the sample size is taken to be N = 500d. With each method, the experiment is repeated 20 times and the RMSE is calculated. The RMSE against the dimensionality d for both test distributions is plotted in Figs. 4 (a) and (b). One can observe here that in most cases, the UM based methods (especially UM-t KSG) offer the best performance. An exception is that CADEE performs better in low dimensional cases for HR, but its RMSE grows much higher than that of the UM methods in the highdimensional regime (d > 15). Our second experiment is to ﬁx the dimensionality at d = 10 and vary the sample size, where the RMSE is plotted against the sample size for both HR and ER in Figs. 4 (c) and (d). The ﬁgures show clearly that the RMSE of the UM based estimators decays faster than other methods in both examples, with the only exception being CADEE in the small sample ( 104) regime of ER. It is also worth noting that, though it is not justiﬁed theoretically, UM-t KSG seems to perform slightly better than UM-t KL in all the cases.

Application to Optimal Experimental Design In this section, we apply entropy estimation to an optimal experimental design (OED) problem. Simply put, the goal of OED is to determine the optimal experimental conditions (e.g., locations of sensors) that maximize certain utility function associated with the experiments. Mathematically let λ D be design parameters representing experimental conditions, θ be the parameter of interest, and Y be the observed data. An often used utility function is the entropy of the data Y , resulting in the so-called maximum entropy sampling method (MES) (Sebastiani and Wynn 2000):

max λ D U(λ) := H(Y |λ), (21)

and therefore evaluating U(λ) becomes an entropy estimation problem. This utility function is equivalent to the mutual entropy criterion under certain conditions (Shewry and Wynn 1987). This formulation is particularly useful for problems with expensive or intractable likelihoods, as the likelihoods are not needed if the utility function is computed via entropy estimation. A common application of OED is to determine

UM-t KL UM-t KSG KL KSG

NF CADEE von-Mises

5 10 15 20 d

Figure 4: From left to right: RMSE vs. dimensionality for HR (a) and ER (b); RMSE vs. sample size for HR (c) and ER (d).

Method UM-t KL UM-t KSG CADEE Equidistant KL KSG NF von-Mises NMC -1.45 -2.73 -1.65 -1.56 -1.48 -1.81 (SE) (0.0073) (0.0074) (0.0072) (0.0076) (0.0072) (0.0049) RMSE 0.73 0.48 0.86 3.60 1.05 0.88 1.31

Table 1: The reference entropy values of the observation time placements obtained by using all the methods. The smallest (best) entropy value is shown in bold.

Figure 5: Top: some sample data paths of (x, y); Bottom: the optimal observation times obtained by the eight methods.

the observation times for stochastic processes so that one can accurately estimate the model parameters and here we provide such an example, arising from the ﬁeld of population dynamics. Speciﬁcally we consider the Lotka-Volterra (LV) predatorprey model (Lotka 1925; Volterra 1927). Let x and y be the populations of prey and predator respectively, and the LV model is given by x = ax xy, y = bxy y, where a and b are respectively the growth rates of the prey and the predator. In practice, often the parameters a and b are not known and need to be estimated from the population data. In a Bayesian framework, one can assign a prior distribution on a and b, and infer them from measurements made on the population (x, y). Here we assume that the prior for both a and b is a uniform distribution U[0.5, 4]. In particular we assume that the pair (x+ϵx, y+ϵy), where ϵx, ϵy N(0, 0.01) are independent observation noises, is measured at d = 5 time points located within the interval [0, 10], and the goal is to determine the observation times for the experiments. As is mentioned earlier, we shall determine the observation times using the MES method. Namely, the design parameter in this example is λ = (t1, ..., td), the data Y is the pair (x + ϵx, y + ϵy) measured at t1, ..., td, and we want to ﬁnd λ that maximizes the entropy H(Y |λ). A common practice in such problems is not to optimize the observation times directly and instead parametrize them using the percentiles of a prescribed distribution to reduce the

optimization dimensionality (Ryan et al. 2014). Here we use a Beta distribution, resulting in two distribution parameters to be optimized (see (Ryan et al. 2014) and SI for further details). We solve the resulting optimization problem with a grid search where the entropy is evaluated by the seven aforementioned estimators each with 10,000 samples. We plot in Fig. 5 the optimal observation time placements computed with the seven aforementioned estimators, as well as the equidistant placement for a comparison purpose. Also shown in the ﬁgure are some sample paths of the population (x, y) where we can see that the population samples are generally subject to larger variations near the two ends and relative smaller ones in the middle. Regarding the optimization results, we see that the optimal time placements obtained by the two UM based estimators and CADEE are the same, while they are different from the results of other methods. To validate the optimization results, we compute a reference entropy value for the optimal placement obtained by each method, using Nested Monte Carlo (NMC) (see (Ryan 2003) and SI for details) with a large sample size (105 105), and show the results in Table 1. Note that though the NMC can produce a rather accurate entropy estimate, it is too expensive to use directly in this OED problem. Using the reference values as the ground truth, we can further compute the RMSE of these estimates (over 20 repetitions), which are also reported in Table 1. From the table one observes that the placement of observation times computed by the two UM methods and CADEE yields the largest entropy values, which indicates that these three methods clearly outperform all the other estimators in this OED problem. Moreover, from the RMSE results we can see that the UM based methods (especially UM-t KSG) yield smaller RMSE than CADEE, suggesting that they are more statistically reliable than CADEE.

In summary we have presented a NF based entropy estimator, which is supported by both theoretical analysis and numerical experiments. We believe that the method can be useful in a wide range of real-world applications involving entropy estimation, for instance, experiment design, and we plan to explore such applications in future studies.

Acknowledgments

This work was partially supported by the China Scholarship Council (CSC). The authors would also like to thank Dr. Alexander Kraskov for discussion about the KSG estimator.

Ao, Z.; and Li, J. 2020. An approximate KLD based experimental design for models with intractable likelihoods. In International Conference on Artiﬁcial Intelligence and Statistics, 3241 3251. PMLR. Ariel, G.; and Louzoun, Y. 2020. Estimating differential entropy using recursive copula splitting. Entropy, 22(2): 236. Azzi, S.; Sudret, B.; and Wiart, J. 2020. Sensitivity analysis for stochastic simulators using differential entropy. International Journal for Uncertainty Quantiﬁcation, 10(1). Beirlant, J.; Dudewicz, E. J.; Györﬁ, L.; and Van der Meulen, E. C. 1997. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1): 17 39. Berrett, T. B.; Samworth, R. J.; Yuan, M.; et al. 2019. Efﬁcient multivariate entropy estimation via k-nearest neighbour distances. Annals of Statistics, 47(1): 288 318. Biau, G.; and Devroye, L. 2015. Lectures on the nearest neighbor method, volume 246. Springer. Birgé, L.; and Massart, P. 1995. Estimation of integral functionals of a density. The Annals of Statistics, 11 29. Chen, W.-C.; Tareen, A.; and Kinney, J. B. 2018. Density estimation on small data sets. Physical review letters, 121(16): 160605. Gao, S.; Ver Steeg, G.; and Galstyan, A. 2015. Efﬁcient estimation of mutual information for strongly dependent variables. In Artiﬁcial intelligence and statistics, 277 286. Gao, W.; Oh, S.; and Viswanath, P. 2018. Demystifying Fixed k-Nearest Neighbor Information Estimators. IEEE Transactions on Information Theory, 64(8): 5629 5661. Goria, M. N.; Leonenko, N. N.; Mergel, V. V.; and Novi Inverardi, P. L. 2005. A new class of random vector entropy estimators and its applications in testing statistical hypotheses. Journal of Nonparametric Statistics, 17(3): 277 297. Györﬁ, L.; and Van der Meulen, E. C. 1987. Density-free convergence properties of various estimators of entropy. Computational Statistics & Data Analysis, 5(4): 425 436. Hall, P.; and Morton, S. C. 1993. On the estimation of entropy. Annals of the Institute of Statistical Mathematics, 45(1): 69 88. Han, Y.; Jiao, J.; Weissman, T.; and Wu, Y. 2020. Optimal rates of entropy estimation over Lipschitz balls. The Annals of Statistics, 48(6): 3228 3250. Ihara, S. 1993. Information theory for continuous systems, volume 2. World Scientiﬁc. Joe, H. 1989. Estimation of entropy and other functionals of a multivariate density. Annals of the Institute of Statistical Mathematics, 41(4): 683 697.

Kandasamy, K.; Krishnamurthy, A.; Poczos, B.; Wasserman, L. A.; and Robins, J. M. 2015. Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations. In NIPS, volume 15, 397 405. Kozachenko, L.; and Leonenko, N. N. 1987. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2): 9 16. Kraskov, A.; Stögbauer, H.; and Grassberger, P. 2004. Estimating mutual information. Physical review E, 69(6): 066138. Krishnamurthy, A.; Kandasamy, K.; Poczos, B.; and Wasserman, L. 2014. Nonparametric estimation of renyi divergence and friends. In International Conference on Machine Learning, 919 927. PMLR. Lotka, A. J. 1925. Elements of physical biology. Williams & Wilkins. Miller, E. G. 2003. A new class of entropy estimators for multi-dimensional densities. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP 03)., volume 3, III 297. IEEE. Pagani, F.; Wiegand, M.; and Nadarajah, S. 2019. An ndimensional Rosenbrock Distribution for MCMC Testing. ar Xiv preprint ar Xiv:1903.09556. Papamakarios, G.; Nalisnick, E.; Rezende, D. J.; Mohamed, S.; and Lakshminarayanan, B. 2021. Normalizing ﬂows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57): 1 64. Papamakarios, G.; Pavlakou, T.; and Murray, I. 2017. Masked autoregressive ﬂow for density estimation. In Advances in Neural Information Processing Systems, 2338 2347. Ranneby, B. 1984. The maximum spacing method. An estimation method related to the maximum likelihood method. Scandinavian Journal of Statistics, 93 112. Rezende, D.; and Mohamed, S. 2015. Variational inference with normalizing ﬂows. In International Conference on Machine Learning, 1530 1538. PMLR. Ryan, E. G.; Drovandi, C. C.; Thompson, M. H.; and Pettitt, A. N. 2014. Towards Bayesian experimental design for nonlinear models that require a large number of sampling times. Computational Statistics & Data Analysis, 70: 45 60. Ryan, K. J. 2003. Estimating expected information gains for experimental designs with application to the random fatiguelimit model. Journal of Computational and Graphical Statistics, 12(3): 585 603. Sebastiani, P.; and Wynn, H. P. 2000. Maximum entropy sampling and optimal Bayesian experimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1): 145 157. Shewry, M. C.; and Wynn, H. P. 1987. Maximum entropy sampling. Journal of applied statistics, 14(2): 165 170. Singh, H.; Misra, N.; Hnizdo, V.; Fedorowicz, A.; and Demchuk, E. 2003. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(34): 301 321.

Singh, S.; and Póczos, B. 2016. Finite-sample analysis of ﬁxed-k nearest neighbor density functional estimators. In Advances in neural information processing systems, 1217 1225. Sricharan, K.; Wei, D.; and Hero, A. O. 2013. Ensemble estimators for multivariate entropy estimation. IEEE transactions on information theory, 59(7): 4374 4388. Tsybakov, A. B.; and Van der Meulen, E. 1996. Root-n consistent estimators of entropy for densities with unbounded support. Scandinavian Journal of Statistics, 75 83. Vasicek, O. 1976. A test for normality based on sample entropy. Journal of the Royal Statistical Society: Series B (Methodological), 38(1): 54 59. Volterra, V. 1927. Variazioni e ﬂuttuazioni del numero d individui in specie animali conviventi. C. Ferrari. Wolsztynski, E.; Thierry, E.; and Pronzato, L. 2005. Minimum-entropy estimation in semi-parametric models. Signal Processing, 85(5): 937 949.