# boosted_density_estimation_remastered__c0c90d0e.pdf Boosted Density Estimation Remastered Zac Cranko 1 2 Richard Nock 2 1 3 There has recently been a steady increase in the number iterative approaches to density estimation. However, an accompanying burst of formal convergence guarantees has not followed; all results pay the price of heavy assumptions which are often unrealistic or hard to check. The Generative Adversarial Network (GAN) literature seem- ingly orthogonal to the aforementioned pursuit has had the side effect of a renewed interest in variational divergence minimisation (notably f-GAN). We show how to combine this latter approach and the classical boosting theory in supervised learning to get the first density estimation algorithm that provably achieves geometric convergence under very weak assumptions. We do so by a trick allowing to combine classifiers as the sufficient statistics of an exponential family. Our analysis includes an improved variational characterisation of f-GAN. 1. Introduction In the emerging area of Generative Adversarial Networks (GAN s) (Goodfellow et al., 2014) a binary classifier, called a discriminator, is used learn a highly efficient sampler for a data distribution P; combining what would traditionally be two steps first learning the density function from a family of densities, then fine-tuning a sampler into one. Interest in this field has sparked a series of formal inquiries and generalisations describing GAN s in terms of (among other things) divergence minimisation (Nowozin et al., 2016; Arjovsky et al., 2017). Using a similar framework to Nowozin et al. (2016), Grover & Ermon (2018) make a preliminary analysis of an algorithm that takes a series of iteratively trained discriminators to estimate a density function1. The 1The Australian National University 2Data61 3The University of Sydney. Correspondence to: Zac Cranko . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 1Grover & Ermon (2018) call this procedure multiplicative discriminative boosting . cost here, insofar as we have been able to devise, is that one forgoes learning an efficient sampler (as with a GAN), and must make do with classical sampling techniques to sample from the learned density. We leave the issue of efficient sampling from these density as an open problem, and instead focus on analysing the densities learned with formal convergence guarantees. Previous formal results have established a range of guarantees, from qualitative convergence (Grover & Ermon, 2018), to geometric convergence rates (Tolstikhin et al., 2017), with numerous results in between (See 1.1). Our starting point is fundamentally different, we learn a density from a sequence of binary classifiers. By using a similar weak learning assumptions in boosting, is shown to be able to fit arbitrarily closely the target density. With the advent of deep learning, such an approach appears to be very promising, as the formal bounds we obtain yield geometric convergence under assumptions arguably much weaker than other similar works in this area The rest of the paper and our contributions are as follows: in 2, to make explicit the connections between classification, density estimation, and divergence minimisation we re-introduce the variational f-divergence formulation, and in doing so are able to fully explain some of the underspecified components of f-GAN (Nowozin et al., 2016); in 3, we relax a number of the assumptions of Grover & Ermon (2018), and then give both more general, and much stronger bounds for their algorithm; in 4, we apply our algorithm to several toy datasets in order demonstrate convergence and compare directly with Tolstikhin et al. (2017); and finally, a final section 5 concludes. The appendices that follow in the supplementary material are: A, we compare our formal results with other related works; B, a geometric account of the function class in the variational form of an f-divergence; C, a further relaxation of the weak learning assumptions to some that could actually be estimated experimentally and a proof that the boosting rates are slightly worse but of essentially the same order; D, proofs for the main formal results from the paper; and finally, E, technical details for the settings of our experiments. Boosted Density Estimation Remastered Table 1: Summary of related works and results in terms of (i) the components aggregated ( updates ) in the final density (this density can be implicit), (ii) the rate to a given KL/JS value and (iii) the assumption(s). Approach Updates Rate Assumption Dai et al., 2016 kernel density estimate / particles (KL(P, Q0)) smoothness, Lipschitz, measure concentration, etc. Tolstikhin et al., 2017 density (log JS(P, Q0)) updates close to optimal Grover & Ermon, 2018 density none none This work binary classifiers (log KL(P, Q0)) weak learning assumption on classifiers, weak dominance 1.1. Related work In learning a density function iteratively, it is remarkable that most previous approaches (Guo et al., 2016; Li & Barron, 2000; Miller et al., 2017; Rosset & Segal, 2002; Tolstikhin et al., 2017; Zhang, 2003, and references therein) have investigated a single update rule, not unlike Frank Wolfe optimisation: qt = h( tj(dt) + (1 t)j(qt 1)), (1) wherein h, j are some transformations that are in general (but not always) the identity, (qt)t2N is the sequence of density functions learnt, and ( t, dt)t2N are the step sizes and updates. The updates and step sizes are chosen so that for some measure of divergence I(P, Qt) ! 0 as t ! 1, where I(P, Qt) is the divergence of the true distribution P from Qt. Grover & Ermon (2018) is one (recent) rare exception to (1) wherein alternative choices are explored. Few works in this area are accompanied by convergence proofs, and even less provide convergence rates (Guo et al., 2016; Li & Barron, 2000; Rosset & Segal, 2002; Tolstikhin et al., 2017; Zhang, 2003). To establish convergence and/or bound convergence by a rate, all approaches necessarily make structural assumptions or approximations on the parameters involved in (1). These assumptions can be on the (local) variation of the divergence (Guo et al., 2016; Naito & Eguchi, 2013; Zhang, 2003), the true distribution or the updates (Dai et al., 2016; Grover & Ermon, 2018; Guo et al., 2016; Li & Barron, 2000),the step size (Miller et al., 2017; Tolstikhin et al., 2017), the previous updates, (di)i t ,(Dai et al., 2016; Rosset & Segal, 2002), and so on. Often in order to produce the best geometric convergence bounds, the update is usually required required to be close to the optimal one (Tolstikhin et al., 2017, Cor. 2, 3). Table 1 compares the best results of the leading three to our approach. We give for each of them the updates aggregated, the assumptions on which rely the results and the rate to come close to a fixed value of KL divergence (Jensen-Shannon, JS, for (Tolstikhin et al., 2017)), which is just the order of the number of iterations necessary, hiding all other dependences for simplicity. However, it must be kept in mind that for many of these works (viz. Tolstikhin et al., 2017) the primary objective is to develop an efficient black box sampler for P, in par- ticular for large dimensions. Our objective however is to focus on furtive lack of formal results on the densities and convergence, instead leaving the problem of sampling from these densities as an open question. 2. Preliminaries In the sequel X is a topological space. Unnormalised Borel measures on X are indicated by decorated capital letters, P, and Borel probability measures by capital letters without decoration, P. To a function f : X ! ( 1, +1] we associate another function f , called the Fenchel conjugate with f (x ) def= supx2X hx , xi f(x). If f is convex, proper, and lower semi-continuous, f = (f ) . If f is strictly convex and differentiable on int(dom f) then (f )0 = (f 0) 1. Theorem-like formal statements are numbered to be consistent with their appearance in the appendix ( D) to which we defer all proofs. An important tool of ours are the f-divergences of information theory (Ali & Silvey, 1966; Csisz ar, 1967). The f-divergence of P from Q is If(P, Q) f(d P/d Q) d Q, where it is assumed that f : R ! ( 1, +1] is convex and lower semi-continuous, and Q dominates P.2 Every f-divergence has a variational representation (Reid & Williamson, 2011) via the Fenchel conjugate: = sup u2(dom f )X = sup u2(dom f )X EP u EQ f u where the supremum is implicitly restricted to measurable functions. In contrast to the abstract family (dom f )X , binary classification models tend to be specified in terms of density 2Common divergence measures such as Kullback Liebler (KL) and total variation can easily be shown to be members of this family by picking f accordingly (Reid & Williamson, 2011). Several examples of these are listed in Table 2. Boosted Density Estimation Remastered Table 2: Some common f-divergences and their variational components. If f(t) f (t ) f 0(t) (f f 0)(t) Kullback Liebler KL t log t exp(t 1) log t + 1 t Reverse KL r KL |t 1| log( t ) 1 1/t log t 1 Hellinger - ( t 1)2 3(t 1) 1 1 1 1/t t 1 Pearson χ2 (t 1)2 t (4 + t )/4 2(t 1) t2 1 GAN GAN t log t (t + 1) log(t + 1) log (1 exp(t )) log(t) log(t + 1) log(1 + t) Table 3: Classification decision rules. Collection R(X) D(X) C(X) Decision rule d 1 D 1 D c 0 int(dom f )X R(X) C(X) Figure 4: Bijections for reparameterising (V). ratios, binary conditional distributions, and binary classifiers, these are respectively3 def= {d : X ! (0, 1)}, D(X) def= {D : X ! (0, 1)}, def= {c : X ! R}. It is easy to see that these sets equivalent, with the commonly used connections def= D 1 D, σ(c) def= 1 1 + exp( c), (' σ) = exp, which are illustrated in Figure 4. It is a common result (Nguyen et al., 2010; Grover & Ermon, 2018; Nowozin et al., 2016) that the supremum in (2) is achieved for f 0 d P/d Q. It s convenient to define the reparameterised variational problem: def= EP f 0 d EQ f f 0 d, (V) where F R(X), so that the unconstrained maximum is achieved for d P/d Q. See B for more details. Example 1 (Neural classifier). A neural network with softmax layer corresponds to an element D 2 C(X). To convert this to an element d 2 R(X) simply substitute the softmax 3While it might seem like there are certain inclusions here (for example D(X) R(X) C(X)), these categories of functions really are distinct objects when thought of with respect to their corresponding binary classification decision rules (listed in Table 3). with an exponential activation function. This is just the arrow C(X) ! R(X) in Figure 4. Example 2 (f-GAN). The GAN objective (Goodfellow et al., 2014) is implicitly solving the reparameterised variational problem (V): (EP log(D) + EQ log(1 D)) = sup D2D(X) (EP (f 0 ') D EQ(f f 0 ') D) = GAN(P, Q), where the function f is defined in Table 2, corresponding to the GAN f-divergence. In our derivation it s clear that (V) together with the isomorphisms in Figure 4 give a simple, principled choice for the output activation function , gf, of Nowozin et al. (2016). 3. Boosted density estimation We fit distributions Qt over the space X incrementally using the following update Qt(dx) = d t t (x) Qi 1(dx), Qt, where Zt where t [0, 1] is the step size (for reasons that will be clear shortly), dt : X ! R+ is a density ratio, and we fix the initial distribution Q0. After t updates we have the distribution def= 1 R Qt t (x)Q0(dx). (4) Proposition 2. The normalisation factors can be written recursively with Zt = Zt 1 EQt 1 d t Proposition 3. Let Qt be defined via (4) with a sequence of binary classifiers c1, . . . , ct 2 C(X), where ci = log di for i 2 [t]. Then Qt is an exponential family distribution with natural parameter def= ( 1, . . . , t) and sufficient statistic c(x) def= (c1(x), . . . , ct(x)). The sufficient statistic of our distributions are classifiers that would hence be learned, along with the appropriate Boosted Density Estimation Remastered fitting of natural parameters. As explained in the proof, the representation may not be minimal; however, without further constraints on , the exponential family is regular (Barndorff-Nielsen, 1978). A similar interpretation of a neural network in terms of parameterising the sufficient statistics of a deformed exponential family is given by Nock et al. (2017). In the remainder of this section, we show how to learn the density ratios di and choose the step sizes i from observed data to ensure convergence Qt !t P. 3.1. Helper results for the convergence of Qt to P The updates dt learnt by solving (V) with Q replaced by Qt 1. To make explicit the dependence of Qt on t we will sometimes write d Qt| t t d Qt 1 and d Qt| t def= 1 Zt d Qt| t. Since Qt is an exponential family (Proposition 3), we measure the divergence between P and Qt using the KL divergence (Table 2), which is the canonical divergence of exponential families (Amari & Nagaoka, 2000). Notice that we can write any solution to (V) as dt = d P/d Qt 1 "t, where we call "t : X ! R+ the error term due to the fact that it is determined by the difference between the constrained and global solutions to (V). A more detailed analysis of the quantity "t is presented in B. Lemma 5. For any t 2 [0, 1], letting Qt, Qt 1 as in (3), we have: KL(P, Qt| t) (1 t) KL(P, Qt 1) + t(log EP "t EP log "t) (5) for all dt 2 R(X), where "t def= (d P/d Qt 1) 1dt. Remark 3. Grover & Ermon (2018) assume a uniform error term, "t 1. In this case Lemma 5 yields geometric convergence 8 t 2 [0, 1] : KL(P, Qt| t) (1 t) KL(P, Qt 1). This result is significantly stronger than Grover & Ermon (2018, Thm. 2), who just show the non-increase of the KL divergence. If, in addition to achieving uniform error, we let t = 1, then (5) guarantees Qt| t=1 = P. We can express the update (4) and (5) in a way that more closely resembles Frank Wolfe update (1). Since "t takes on positive values, we can identify it with a density ratio involving a (not necessarily normalised) measure Rt, as follows def= "t(x) P(dx) and Rt Introducing Rt allows us to lend some interpretation to Lemma 5 in terms of the probability measure Rt. If we assume X is a measure space and that all measures have positive density functions (denoted by lowercase letters) with respect to the ambient measure then or equivalently 9C > 0 : log qt = t log rt + (1 t)qt 1 C. This shows the manner in which (3) is a special case of (1). Corollary 6. For any t 2 [0, 1] and "t 2 [0, +1)X , letting Qt as in (4) and Rt from (6). If Rt satisfies KL(P, Rt) γ KL(P, Qt 1) (7) for γ 2 [0, 1], then KL(P, Qt| t) (1 t(1 γ)) KL(P, Qt 1). We obtain the same geometric convergence as Tolstikhin et al. (2017, Cor. 2) for a boosted distribution Qt which is not a convex mixture, which, to our knowledge, is a new result. Corollary 6 is restricted to the KL divergence but we do not need the technical domination assumption that Tolstikhin et al. (2017, Cor. 2) require. From the standpoint of weak versus strong learning, Tolstikhin et al. (2017, Cor. 2) require a condition similar to (7), that is, the iterate Rt has to be close enough to P. It is the objective of the following sections to relax this requirement to something akin to the weak updates common in a boosting scheme. 3.2. Convergence under weak assumptions In the previous section we have established two preliminary convergence results (Remark 3, Corollary 6) that equal the state of the art and/or rely on similarly, strong assumptions. We now show how to relax these in favour of placing some weak conditions on the binary classifiers learnt in Equation 2. Define the two expected edges of ct (cf. Nock & Nielsen, 2008): c? EQt 1[ ct] and µP c? EP [ct], (8) where c? def= maxt esssup |ct| and the maximum is implicitly over all previous iterations. Classical boosting results rely on assumption on such edges for different kinds of ct (Freund & Schapire, 1997; Schapire, 1990; Schapire & Singer, 1999) and the implicit and weak assumption, that we also make, that 0 < c? < 1, that is, the classifiers have bounded and nonzero confidence. By construction then, Qt 1, µP 2 [ 1, 1]. The difference of sign of ct is due to the decision rule for a binary classifier (Table 3), whereby ct(x) 0 reflects that ct classifies x 2 X as originating from P rather than Qt 1, and vice versa for ct(x). Boosted Density Estimation Remastered c? ct 1 c? ct (a) WLA is satisfied since µP and Qt 1 are positive. c? ct 1 c? ct (b) WLA is violated since Qt 1 is negative. Figure 5: Illustration of WLA in one dimension with a classifier ct and its decision rule (indicated by the dashed grey line). The red (resp. blue) area is the area under the ct/c? p (resp. ct/c? qt 1) line (where p, qt 1 are corresponding density functions), that is, µP (resp. Qt 1). Assumption 1 (Weak learning assumption). 9γP , γQ 2 (0, 1] : µP γP , Qt 1 γQ. (WLA) The weak learning assumption is in effect a separation condition of P and Qt 1. That is, the decision boundary associated with ct correctly divides most of the mass of P and most of the mass of Qt 1. This is illustrated in Figure 5. Note that if Qt 1 has converged to P, the weak learning assumption cannot hold. This is reasonable since as Qt 1 ! P it becomes harder to build a classifier to tell them apart. We note that classical boosting would rely on a single inequality for the weak learning assumption (involving the two edges) (Schapire & Singer, 1999) instead of two as in WLA. The difference is, however, superficial as we can show that both assumptions are equivalent (Lemma 7 in D). A boosting algorithm would ensure, for any given error % > 0, that there exists a number of iterations T for which we do have KL(P, QT ) %, where T is required to be polynomial in all relevant parameters, in particular 1/γP , 1/γQ, c?, KL(P, Q0). Notice that we have to put KL(P, Q0) in the complexity requirement since it can be arbitrarily large compared to the other parameters. Theorem 18. Suppose WLA holds at each iteration. Then using Qt as in (4) and t as in (9), we are guaranteed that KL(P, QT ) % after a number of iterations T satisfying: T 2 KL(P, Q0) % There is more to boosting: the question naturally arises as to whether faster convergence is possible. A simple observation allows to conclude that it should require more than just WLA. Define c? EP log "t, the normalised expected log-density estimation error. Then we have µP = (1/c?) KL(P, Qt 1) + µ"t, so controlling µP does not give substantial leverage on KL(P, Qt) because of the unknown µ"t. We show that an additional weak assumption on µ"t is all that is needed with WLA, to obtain convergence rates that compete with Tolstikhin et al. (2017, Lem. 2) but using much weaker assumptions. We call this assumption the Weak Dominance Assumption (WDA). Assumption 2 (Weak Dominance Assumption). 9Γ" > 0, 8t 1 : µ"t Γ" (WDA) The assumption WDA takes its name from the observation that we have ct = log dt = log and |ct| c?, so by ensuring that "t is going to be non-zero P-almost everywhere, WDA states that nowhere in the support do we have Qt 1 with respect to P. This also looks like a weak finite form of absolute continuity of P with respect to Qt 1, which is not unlike the boundedness condition on the log-density ratio of Li & Barron (2000, Thm. 1). Provided WLA and WDA hold at each iteration, Theorem 19 yields geometric boosting convergence rates. Theorem 19. Suppose WLA and WDA hold at each boosting iteration. Then after T boosting iterations: 1 γP min{2, γQ/c?} Note that the bound obtained in Theorem 19 is, in fact, logarithmic in KL(P, Q0). That is have KL(P, QT ) % when T 2(1 + Γ") γP min{2, γQ/c?} log Boosted Density Estimation Remastered 4. Experiments Let t 2 {0, . . . , T}. Our experiments mostly take place in X def= R2, where we use a simple neural network classifier ct 2 C(X), which we train using cross entropy error by post composing it with the logistic sigmoid: σ ct. After training ct we transform it into to a density ratio using an exponential function: dt def= exp ct (cf. 2) which we use to update Qt 1. In most experiments we train for T > 1 rounds therefore we need to sample from Qt 1.4 Our setting here is simple and so this is easily accomplished using random walk Metropolis Hastings. As noted in the introduction, in more sophisticated domains it remains an open question how to sample effectively from a density of the form (4), in particular for a support having large dimensionality. Since our classifiers ct are the outputs of a neural network they are unbounded, this violates the assumptions of 3, therefore in most cases we use the naive choice t Metrics At each t 2 {0, . . . , T} we estimate compute KL(P, Qt), (normalised) Negative Log-Likelihood (NLL) 1 EP log p EP log qt, and accuracy EP Jct > 1 2K. Note that we normalise NLL by its true value to make this quantity more interperable. The KL divergence is computed using numerical integration, and as such it can be quite tricky to ensure stability when running stochastically varying experiments, and becomes very hard to compute in dimensions higher than n = 2. In these computationally difficult cases we use NLL, which is much more stable by comparison. We plot the mean and 95% confidence intervals for these quantities. 4.1. Results Complete details about the experimental procedures including target data and network architectures are deferred to the supplementary material ( E). 4.1.1. ERROR AND CONVERGENCE (0.5 if Qt ! P ) 0 1 2 3 4 5 0 (lower is better) Figure 6: As KL(P, Qt) ! 0 it becomes harder and harder to train a good classifier ct. 4It is easy to pick Q0 to be convenient to sample from. In order to minimise the divergence of Qt from P we need to train a sufficiently good classifier ct such that we can build a good approximation to d P/d Qt 1. Naturally as Qt ! P it should become harder and harder to tell the difference between a sample from P and Qt with high probability. This is exactly what we observe. In Figure 6 we train a classifier with the same neural network topology as in 4.1.2. The test accuracy over the course of training before each t is plotted. As KL(P, Qt) ! 0 samples from P and Qt become harder and harder to tell apart and the best accuracy we can achieve over the course of training decreases, approaching 1/2. Dually, the higher the training accuracy achieved by ct, the greater the reduction from KL(P, Qt 1) to KL(P, Qt), thus the decreasing saw-tooth shape in Figure 6 is characteristic of convergence. 4.1.2. ACTIVATION FUNCTIONS To look at the effect of the choice of activation function we train the same network topology, for a set of activation functions: Numerical results trained to fit a ring of Gaussians are plotted in Figure 8a, contour plots of some of the resulting densities are presented Figure 7. All activation functions except for Softplus performed about the same by the end of the sixth round, with Re LU and SELU being the marginal winners. It is also interesting to note the narrow error ribbons on tanh compared to the other functions, indicating more consistent training. (a) P (b) Re LU (c) tanh (d) Softplus Figure 7: The effect of different activation functions, modelling a ring of Gaussians. The petals in the Re LU condition are likely due to the linear hyperplane sections the final network layer being shaped by the final exponential layer. Boosted Density Estimation Remastered 4.1.3. NETWORK TOPOLOGY To compare the effect of the choice of network architecture we fix activation function and try a variety of combinations of network architecture, varying both the depth and the number nodes per layer. For this experiment the target distribution P is a mixture of 8 Gaussians that are randomly positioned at the beginning of each run of training. Let m n denote a fully connected neural network ct with m hidden layers and n nodes per layer. After each hidden layer we apply the SELU activation function. Numerical results are plotted in Figure 8b. Interestingly doubling the nodes per layer has little benefit, showing only moderate advantage. By comparison, increasing the network depth allows us to achieve over a 70% reduction in the minimal divergence we are able to achieve. 0 1 2 3 4 5 6 0 (lower is better) Re LU SELU Softplus Sigmoid tanh (a) Activation function experiment 0 1 2 3 4 5 6 0 (lower is better) SELU : 1 5 SELU : 2 5 SELU : 1 10 SELU : 2 10 SELU : 2 20 (b) Network topology experiment Figure 8: KL divergence for a variety of activation functions and architectures over six iterations of boosting. 4.1.4. CONVERGENCE ACROSS DIMENSIONS For this experiment we vary the dimension n 2 {2, 4, 6} of the space X = Rn using a neural classifier ct that is trained without regard for overfitting and look at the convergence 0 1 2 3 4 5 6 7 8 9 10 0 Negative Log-Likelihood (closer to 1 is better) Figure 9: Convergence in more dimensions. of NLL (Figure 9). After we achieve the optimal NLL of 1, we observe that NLL becomes quite variable as we begin to overfit. Secondly overfitting the likelihood becomes harder as we increase the dimensionality, taking roughly two times the number of iterations to pass NLL = 1 in the n = 4 condition as the n = 2 condition. We conjecture that not overfitting is a matter of early stopping boosting, in a similar way as it was proven for the consistency of boosting algorithms (Bartlett & Traskin, 2006). 4.1.5. COMPARISON WITH TOLSTIKHIN ET AL. (2017) To compare the performance of our model (here called DIS- CRIM) with ADAGAN we replicate their Gaussian mixture toy experiment,5 fitting a randomly located eight component isotropic Gaussian mixture where each component has constant variance. These are sampled using the code provided by Tolstikhin et al. (2017). We compute the coverage metric6 of Tolstikhin et al. (2017): C (P, Q) def= P(lev>β q), where Q(lev>β q) = , and 2 [0, 1]. That is, we first find β to determine a set where most of the mass of Q lies, lev>β q, then look at how much of the mass of P resides there. Results from the experiment are plotted in Figure 10. Both DISCRIM and ADAGAN converge closely to the true NLL, and then we observe the same characteristic overfitting in previous experiments after iteration 4 (Figure 10a). It is also interesting that this also reveals itself in a degradation of the coverage metric Figure 10b. Notably ADAGAN converges tightly, with NLL centred around its mean, while DISCRIM begins to vary wildly. However the Ada GAN procedure includes a step size that decreases with 1/t thereby preventing overfitting whereas DISCRIM uses a constant step size of 1/2. Suggesting that a similarly decreasing procedure for t may have desirable properties. 4.1.6. COMPARISON WITH KDE In this experiment we compare our boosted densities with Kernel Density Estimation (KDE). For this experiment we train a deep neural network with three hidden layers. The step size is selected to minimise NLL by evaluating the training set at 10 equally spaced points over [0, 1]. We compare the resultant density after T = 2 rounds with a 5This is the experiment gaussian gmm.py at github.com/tolstikhin/adagan 6The coverage metric C can be a bit misleading since any density Q that covers P will yield high C (P, Q), no matter how spread out it is. This is the case at t = 0 when we initially fit Q0. A high coverage metric, however, is sufficient to claim that a model Q has not ignored any of the mass of P when combined with another metric such as NLL. That is, a high C is a necessary condition for mode-capture. Boosted Density Estimation Remastered 0 1 2 3 4 5 6 7 8 5 Negative Log-Likelihood (closer to 1 is better) DISCRIM ADAGAN 0 1 2 3 4 5 6 7 8 0.6 (higher is better) DISCRIM ADAGAN Figure 10: Comparing the performance of DISCRIM and ADAGAN. variety of kernel density estimators, with bandwidth selected via the Scott/Silverman rule.7 Results from this experiment are displayed in Figure 11 (in the supplementary material). On average Q1 fits the target distribution P better than all but the most efficient kernels, and at Q2 we begin overfitting, which aligns with the observations made in 4.1.4. We note that his performance is with a model with around 200 parameters, while the kernel estimators each have 2000 i.e. we achieve KDE s performances with models whose size is the tenth of KDE s. Also, in this experiment t is selected to minimise NLL, however it is not hard to imagine that a different selection criteria for t would yield better properties with respect to overfitting. 4.2. Summary We summarise here some key experimental observations: Both the activation functions and network topology have a large effect on the ease of training and the quality of the learned density QT with deeper networks with fewer nodes per layer yielding the best results ( 4.1.2, 4.1.3). When the networks ct are trained long enough we ob- 7The Scott and Silverman rules yield identical bandwidth selection criteria in the two-dimensional case. serve overfitting in the resulting densities QT and instability in the training procedure after the point of overfitting ( 4.1.4 4.1.6, 4.1.5), indicating that a procedure to take t ! 0 should be optimal. We were able to match the performance of kernel den- sity estimation with a naive procedure to select t ( 4.1.6). We were able to at least match the performance of ADAGAN with respect to density estimation ( 4.1.5). Finally, while we have used KDE as a point of comparison of algorithm, there is no reason why the two techniques could not be combined. Since KDE is a closed form mixture distribution that s quite easy sampled, there is no reason why one couldn t build some kind of kernel density distribution and use this for Q0 which one could refine with a neural network. 5. Conclusion The prospect of learning a density iteratively with a boostinglike procedure has recently been met with significant attention. However, the success these approaches hinge on the existence of oracles satisfying very strong assumptions. By contrast, we have shown that a weak learner in the original sense of Kearns (1988) is sufficient to yield comparable or better convergence bounds than previous approaches when reinterpreted in terms of learning a density ratio. To derive this result we leverage a series of related contributions including 1) a finer characterisation of f-GAN (Nowozin et al., 2016), and 2) a full characterisation we learn, in terms of an exponential family. Experimentally, our approach shows promising results for with respect to the capture of modes, and significantly outperforms Ada GAN during the early boosting iterations using a comparatively very small architecture. Our experiments leave open the challenge to obtain a black box sampler for domains with moderate to large dimension. We conjecture that the exponential family characterisation should be of significant help in tackling this challenge. Acknowledgements We wish to thank Ilya Tolstikhin, Olivier Bousquet, and the reviewers, for many helpful suggestions and remarks. The boosted density implementation is available at https://github.com/Zac Cranko/Boosted Densities.jl. Boosted Density Estimation Remastered Ali, S. M. and Silvey, S. D. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pp. 131 142, 1966. Amari, S.-I. and Nagaoka, H. Methods of Information Ge- ometry. Oxford University Press, 2000. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017. Banerjee, A., Merugu, S., Dhillon, I., and Ghosh, J. Clus- tering with Bregman divergences. JMLR, 6:1705 1749, 2005. Barndorff-Nielsen, O. Information and Exponential Fami- lies in Statistical Theory. Wiley Publishers, 1978. Bartlett, P. and Traskin, M. Adaboost is consistent. In NIPS*19, 2006. Csisz ar, I. Information type measures of difference of prob- ability distributions. Studia Scientiarum Mathematicarum Hungarica, 17:123 149, 1967. Dai, B., He, N., Dai, H., and Song, L. Provable bayesian inference via particle mirror descent. In Artificial Intelligence and Statistics, pp. 985 994, 2016. Dud ık, M., Phillips, S.-J., and Schapire, R.-E. Performance guarantees for regularized maximum entropy density estimation. In COLT, 2004. Freund, Y. and Schapire, R. E. A Decision-Theoretic gener- alization of on-line learning and an application to Boosting. JCSS, 55:119 139, 1997. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Grover, A. and Ermon, S. Boosted generative models. In AAAI, 2018. Guo, F., Wang, X., Fan, K., Broderick, T., and Dunson, D.-B. Boosting variational inference. Co RR, abs/1611.05559, 2016. Innes, M. Flux: Elegant machine learning with julia. Jour- nal of Open Source Software, 2018. doi: 10.21105/joss. 00602. Kearns, M. Thoughts on hypothesis boosting. Unpublished manuscript, 45:105, 1988. Khan, M.-E., Babanezhad, R., Lin, W., Schmidt, M.-W., and Sugiyama, M. Faster stochastic variational inference using proximal-gradient methods with general divergence functions. In UAI, 2016. Li, J. Q. and Barron, A. R. Mixture density estimation. In Advances in neural information processing systems, pp. 279 285, 2000. Locatello, F., Khanna, R., Ghosh, J., and R atsch, G. Boost- ing variational inference: an optimization perspective. Co RR, abs/1708.01733, 2017. Mc Diarmid, C. Concentration. In Habib, M., Mc Diarmid, C., Ramirez-Alfonsin, J., and Reed, B. (eds.), Probabilistic Methods for Algorithmic Discrete Mathematics, pp. 1 54. Springer Verlag, 1998. Miller, A.-C., Foti, N.-J., and Adams, R.-P. Variational boosting: Iteratively refining posterior approximations. In ICML, pp. 2420 2429, 2017. Naito, K. and Eguchi, S. Density estimation with minimiza- tion of U-divergence. Machine Learning, 90(1):29 57, 2013. Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010. Nock, R. and Nielsen, F. A Real Generalization of discrete Ada Boost. Artificial Intelligence, 171:25 41, 2007. Nock, R. and Nielsen, F. On the efficient minimization of classification-calibrated surrogates. In Advances in neural information processing systems, pp. 1201 1208, 2008. Nock, R., Cranko, Z., Menon, A. K., Qu, L., and Williamson, R. C. f-gans in an information geometric nutshell. In Advances in Neural Information Processing Systems, pp. 456 464, 2017. Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271 279, 2016. Penot, J.-P. Calculus without derivatives, volume 266. Springer Science & Business Media, 2012. Reid, M. D. and Williamson, R. C. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12(Mar):731 817, 2011. Rosset, S. and Segal, E. Boosting density estimation. In NIPS, pp. 641 648, 2002. Boosted Density Estimation Remastered Schapire, R. E. The strength of weak learnability. Machine Learning, pp. 197 227, 1990. Schapire, R. E. and Singer, Y. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37:297 336, 1999. Simi c, S. On a new converse of Jensen s inequality. Publi- cations de l Institut Math ematique, 85:107 110, 2009a. Simi c, S. On an upperbound for Jensen s inequality. Journal of Inequalities in Pure and Applied Mathematics, 10, 2009b. Tolstikhin, I.-O., Gelly, S., Bousquet, O., Simon-Gabriel, C., and Sch olkopf, B. Adagan: Boosting generative models. In NIPS, pp. 5430 5439, 2017. Zhang, T. Sequential greedy approximation for certain convex optimization problems. IEEE Trans. IT, 49:682 691, 2003.