# adagan_boosting_generative_models__450992f5.pdf Ada GAN: Boosting Generative Models Ilya Tolstikhin MPI for Intelligent Systems Tübingen, Germany ilya@tue.mpg.de Sylvain Gelly Google Brain Zürich, Switzerland sylvaingelly@google.com Olivier Bousquet Google Brain Zürich, Switzerland obousquet@google.com Carl-Johann Simon-Gabriel MPI for Intelligent Systems Tübingen, Germany cjsimon@tue.mpg.de Bernhard Schölkopf MPI for Intelligent Systems Tübingen, Germany bs@tue.mpg.de Generative Adversarial Networks (GAN) are an effective method for training generative models of complex data such as natural images. However, they are notoriously hard to train and can suffer from the problem of missing modes where the model is not able to produce examples in certain regions of the space. We propose an iterative procedure, called Ada GAN, where at every step we add a new component into a mixture model by running a GAN algorithm on a re-weighted sample. This is inspired by boosting algorithms, where many potentially weak individual predictors are greedily aggregated to form a strong composite predictor. We prove analytically that such an incremental procedure leads to convergence to the true distribution in a finite number of steps if each step is optimal, and convergence at an exponential rate otherwise. We also illustrate experimentally that this procedure addresses the problem of missing modes. 1 Introduction Imagine we have a large corpus, containing unlabeled pictures of animals, and our task is to build a generative probabilistic model of the data. We run a recently proposed algorithm and end up with a model which produces impressive pictures of cats and dogs, but not a single giraffe. A natural way to fix this would be to manually remove all cats and dogs from the training set and run the algorithm on the updated corpus. The algorithm would then have no choice but to produce new animals and, by iterating this process until there s only giraffes left in the training set, we would arrive at a model generating giraffes (assuming sufficient sample size). At the end, we aggregate the models obtained by building a mixture model. Unfortunately, the described meta-algorithm requires manual work for removing certain pictures from the unlabeled training set at every iteration. Let us turn this into an automatic approach, and rather than including or excluding a picture, put continuous weights on them. To this end, we train a binary classifier to separate true pictures of the original corpus from the set of synthetic pictures generated by the mixture of all the models trained so far. We would expect the classifier to make confident predictions for the true pictures of animals missed by the model (giraffes), because there are no synthetic pictures nearby to be confused with them. By a similar argument, the classifier should make less confident predictions for the true pictures containing animals already generated by one of the trained models (cats and dogs). For each picture in the corpus, we can thus use the classifier s confidence to compute a weight which we use for that picture in the next iteration, to be performed on the re-weighted dataset. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. The present work provides a principled way to perform this re-weighting, with theoretical guarantees showing that the resulting mixture models indeed approach the true data distribution.1 ALGORITHM 1 Ada GAN, a meta-algorithm to construct a strong mixture of T individual generative models (f.ex. GANs), trained sequentially. Input: Training sample SN := {X1, . . . , XN}. Output: Mixture generative model G = GT . Train vanilla GAN G1 = GAN(SN, W1) with a uniform weight W1 = (1/N, . . . , 1/N) over the training points for t = 2, . . . , T do #Choose the overall weight of the next mixture component βt = Choose Mixture Weight(t) #Update the weight of each training example Wt = Update Training Weights(Gt 1, SN, βt) #Train t-th weak component generator Gc t Gc t = GAN(SN, Wt) #Update the overall generative model: #Form a mixture of Gt 1 and Gc t. Gt = (1 βt)Gt 1 + βt Gc t end for Before discussing how to build the mixture, let us consider the question of building a single generative model. A recent trend in modelling high dimensional data such as natural images is to use neural networks [1, 2]. One popular approach are Generative Adversarial Networks (GAN) [2], where the generator is trained adversarially against a classifier, which tries to differentiate the true from the generated data. While the original GAN algorithm often produces realistically looking data, several issues were reported in the literature, among which the missing modes problem, where the generator converges to only one or a few modes of the data distribution, thus not providing enough variability in the generated data. This seems to match the situation described earlier, which is why we will most often illustrate our algorithm with a GAN as the underlying base generator. We call it Ada GAN, for Adaptive GAN, but we could actually use any other generator: a Gaussian mixture model, a VAE [1], a WGAN [3], or even an unrolled [4] or mode-regularized GAN [5], which were both already specifically developed to tackle the missing mode problem. Thus, we do not aim at improving the original GAN or any other generative algorithm. We rather propose and analyse a meta-algorithm that can be used on top of any of them. This meta-algorithm is similar in spirit to Ada Boost in the sense that each iteration corresponds to learning a weak generative model (e.g., GAN) with respect to a re-weighted data distribution. The weights change over time to focus on the hard examples, i.e. those that the mixture has not been able to properly generate so far. Related Work Several authors [6, 7, 8] have proposed to use boosting techniques in the context of density estimation by incrementally adding components in the log domain. This idea was applied to GANs in [8]. A major downside of these approaches is that the resulting mixture is a product of components and sampling from such a model is nontrivial (at least when applied to GANs where the model density is not expressed analytically) and requires techniques such as Annealed Importance Sampling [9] for the normalization. When the log likelihood can be computed, [10] proposed to use an additive mixture model. They derived the update rule via computing the steepest descent direction when adding a component with infinitesimal weight. However, their results do not apply once the weight β becomes non-infinitesimal. In contrast, for any fixed weight of the new component our approach gives the overall optimal update (rather than just the best direction) for a specified f-divergence. In both theories, improvements of the mixture are guaranteed only if the new weak learner is still good enough (see Conditions 10&11) Similarly, [11] studied the construction of mixtures minimizing the Kullback divergence and proposed a greedy procedure for doing so. They also proved that under certain conditions, finite mixtures can approximate arbitrary mixtures at a rate 1/k where k is the number of components in the mixture when the weight of each newly added component is 1/k. These results are specific to the Kullback divergence but are consistent with our more general results. An additive procedure similar to ours was proposed in [12] but with a different re-weighting scheme, which is not motivated by a theoretical analysis of optimality conditions. On every new iteration the authors run GAN on the k training examples with maximal values of the discriminator from the last iteration. 1Note that the term mixture should not be interpreted to imply that each component models only one mode: the models to be combined into a mixture can themselves cover multiple modes. Finally, many papers investigate completely different approaches for addressing the same issue by directly modifying the training objective of an individual GAN. For instance, [5] add an autoencoding cost to the training objective of GAN, while [4] allow the generator to look few steps ahead when making a gradient step. The paper is organized as follows. In Section 2 we present our main theoretical results regarding iterative optimization of mixture models under general f-divergences. In Section 2.4 we show that if optimization at each step is perfect, the process converges to the true data distribution at exponential rate (or even in a finite number of steps, for which we provide a necessary and sufficient condition). Then we show in Section 2.5 that imperfect solutions still lead to the exponential rate of convergence under certain weak learnability conditions. These results naturally lead to a new boosting-style iterative procedure for constructing generative models. When used with GANs, it results in our Ada GAN algorithm, detailed in Section 3 . Finally, we report initial empirical results in Section 4, where we compare Ada GAN with several benchmarks, including original GAN and uniform mixture of multiple independently trained GANs. Part of new theoretical results are reported without proofs, which can be found in appendices. 2 Minimizing f-divergence with Mixtures 2.1 Preliminaries and notations Generative Density Estimation In density estimation, one tries to approximate a real data distribution Pd, defined over the data space X, by a model distribution Pmodel. In the generative approach one builds a function G : Z X that transforms a fixed probability distribution PZ (often called the noise distribution) over a latent space Z into a distribution over X. Hence Pmodel is the pushforward of PZ, i.e. Pmodel(A) = PZ(G 1(A)). With this approach it is in general impossible to compute the density d Pmodel(x) and the log-likelihood of the training data under the model, but one can easily sample from Pmodel by sampling from PZ and applying G. Thus, to construct G, instead of comparing Pmodel directly with Pd, one compares their samples. To do so, one uses a similarity measure D(Pmodel Pd) which can be estimated from samples of those distributions, and thus approximately minimized over a class G of functions. f-Divergences In order to measure the agreement between the model distribution and the true distribution we will use an f-divergence defined in the following way: Df(Q P) := Z f d Q d P (x) d P(x) (1) for any pair of distributions P, Q with densities d P, d Q with respect to some dominating reference measure µ (we refer to Appendix D for more details about such divergences and their domain of definition). Here we assume that f is convex, defined on (0, ), and satisfies f(1) = 0. We will denote by F the set of such functions. 2 As demonstrated in [16, 17], several commonly used symmetric f-divergences are Hilbertian metrics, which in particular means that their square root satisfies the triangle inequality. This is true for the Jensen-Shannon divergence3, the Hellinger distance and the Total Variation among others. We will denote by FH the set of functions f such that Df is a Hilbertian metric. GAN and f-divergences The original GAN algorithm [2] optimizes the following criterion: min G max D EPd [log D(X)] + EPZ [log(1 D(G(Z)))] , (2) where D and G are two functions represented by neural networks. This optimization is performed on a pair of samples (a training sample from Pd and a fake sample from PZ), which corresponds to approximating the above criterion by using the empirical distributions. In the non-parametric limit for D, this is equivalent to minimizing the Jensen-Shannon divergence [2]. This point of view can be generalized to any other f-divergence [13]. Because of this strong connection between adversarial 2Examples of f-divergences include the Kullback-Leibler divergence (obtained for f(x) = x log x) and Jensen-Shannon divergence (f(x) = (x + 1) log x+1 2 + x log x). Other examples can be found in [13]. For further details we refer to Section 1.3 of [14] and [15]. 3which means such a property can be used in the context of the original GAN algorithm. training of generative models and minimization of f-divergences, we cast the results of this section into the context of general f-divergences. Generative Mixture Models In order to model complex data distributions, it can be convenient to use a mixture model of the following form: P T model := PT i=1 αi Pi, where αi 0, P i αi = 1, and each of the T components is a generative density model. This is natural in the generative context, since sampling from a mixture corresponds to a two-step sampling, where one first picks the mixture component (according to the multinomial distribution with parameters αi) and then samples from it. Also, this allows to construct complex models from simpler ones. 2.2 Incremental Mixture Building We restrict ourselves to the case of f-divergences and assume that, given an i.i.d. sample from any unknown distribution P, we can construct a simple model Q G which approximately minimizes4 min Q G Df(Q P). (3) Instead of modelling the data with a single distribution, we now want to model it with a mixture of distributions Pi,where each Pi is obtained by a training procedure of the form (3) with (possibly) different target distributions P for each i. A natural way to build a mixture is to do it incrementally: we train the first model P1 to minimize Df(P1 Pd) and set the corresponding weight to α1 = 1, leading to P 1 model = P1. Then after having trained t components P1, . . . , Pt G we can form the (t + 1)-st mixture model by adding a new component Q with weight β as follows: P t+1 model := i=1 (1 β)αi Pi + βQ. (4) where β [0, 1] and Q G is computed by minimizing: min Q Df((1 β)Pg + βQ Pd), (5) where we denoted Pg := P t model the current generative mixture model before adding the new component. We do not expect to find the optimal Q that minimizes (5) at each step, but we aim at constructing some Q that slightly improves our current approximation of Pd, i.e. such that for c < 1 Df((1 β)Pg + βQ Pd) c Df(Pg Pd) . (6) This greedy approach has a significant drawback in practice. As we build up the mixture, we need to make β decrease (as P t model approximates Pd better and better, one should make the correction at each step smaller and smaller). Since we are approximating (5) using samples from both distributions, this means that the sample from the mixture will only contain a fraction β of examples from Q. So, as t increases, getting meaningful information from a sample so as to tune Q becomes harder and harder (the information is diluted ). To address this issue, we propose to optimize an upper bound on (5) which involves a term of the form Df(Q R) for some distribution R, which can be computed as a re-weighting of the original data distribution Pd. This procedure is reminiscent of the Ada Boost algorithm [18], which combines multiple weak predictors into one strong composition. On each step Ada Boost adds new predictor to the current composition, which is trained to minimize the binary loss on the re-weighted training set. The weights are constantly updated to bias the next weak learner towards hard examples, which were incorrectly classified during previous stages. In the following we will analyze the properties of (5) and derive upper bounds that provide practical optimization criteria for building the mixture. We will also show that under certain assumptions, the minimization of the upper bound leads to the optimum of the original criterion. 2.3 Upper Bounds We provide two upper bounds on the divergence of the mixture in terms of the divergence of the additive component Q with respect to some reference distribution R. 4One example of such a setting is running GANs. Lemma 1 Given two distributions Pd, Pg and some β [0, 1], then, for any Q and R, and f FH: q Df((1 β)Pg + βQ Pd) q βDf(Q R) + q Df((1 β)Pg + βR Pd) . (7) If, more generally, f F, but βd R d Pd, then: Df((1 β)Pg + βQ Pd) βDf(Q R) + (1 β)Df We can thus exploit those bounds by introducing some well-chosen distribution R and then minimizing them with respect to Q. A natural choice for R is a distribution that minimizes the last term of the upper bound (which does not depend on Q). Our main result indicates the shape of the distributions minimizing the right-most terms in those bounds. Theorem 1 For any f-divergence Df, with f F and f differentiable, any fixed distributions Pd, Pg, and any β (0, 1], the minimizer of (5) over all probability distributions P has density d Q β(x) = 1 β (λ d Pd(x) (1 β)d Pg(x))+ = d Pd λ (1 β)d Pg for the unique λ [β, 1] satisfying R d Q β = 1. Also, λ = 1 if and only if Pd((1 β)d Pg > d Pd) = 0, which is equivalent to βd Q β = d Pd (1 β)d Pg. Theorem 2 Given two distributions Pd, Pg and some β (0, 1], assume Pd (d Pg = 0) < β. Let f F. The problem min Q:βd Q d Pd Df has a solution with the density d Q β(x) = 1 β d Pd(x) λ (1 β)d Pg(x) + for the unique λ 1 that satisfies R d Q β = 1. Surprisingly, in both Theorems 1 and 2, the solutions do not depend on the choice of the function f, which means that the solution is the same for any f-divergence5. Note that λ is implicitly defined by a fixed-point equation. In Section 3 we will show how it can be computed efficiently in the case of empirical distributions. 2.4 Convergence Analysis for Optimal Updates In previous section we derived analytical expressions for the distributions R minimizing last terms in upper bounds (8) and (7). Assuming Q can perfectly match R, i.e. Df(Q R) = 0, we are now interested in the convergence of the mixture (4) to the true data distribution Pd when Q = Q β or Q = Q β. We start with simple results showing that adding Q β or Q β to the current mixture would yield a strict improvement of the divergence. Lemma 2 (Property 6: exponential improvements) Under the conditions of Theorem 1, we have Df (1 β)Pg + βQ β Pd Df (1 β)Pg + βPd Pd (1 β)Df(Pg Pd). Under the conditions of Theorem 2, we have Pg Pd βQ β 1 β Df(Pg Pd) and Df (1 β)Pg + βQ β Pd (1 β)Df(Pg Pd). Imagine repeatedly adding T new components to the current mixture Pg, where on every step we use the same weight β and choose the components described in Theorem 1. In this case Lemma 2 guarantees that the original objective value Df(Pg Pd) would be reduced at least to (1 β)T Df(Pg Pd). 5in particular, by replacing f with f (x) := xf(1/x), we get the same solution for the criterion written in the other direction. Hence the order in which we write the divergence does not matter and the optimal solution is optimal for both orders. This exponential rate of convergence, which at first may look surprisingly good, is simply explained by the fact that Q β depends on the true distribution Pd, which is of course unknown. Lemma 2 also suggests setting β as large as possible since we assume we can compute the optimal mixture component (which for β = 1 is Pd). However, in practice we may prefer to keep β relatively small, preserving what we learned so far through Pg: for instance, when Pg already covered part of the modes of Pd and we want Q to cover the remaining ones. We provide further discussions on choosing β in Section 3. 2.5 Weak to Strong Learnability In practice the component Q that we add to the mixture is not exactly Q β or Q β, but rather an approximation to them. In this section we show that if this approximation is good enough, then we retain the property (6) (exponential improvements). Looking again at Lemma 1 we notice that the first upper bound is less tight than the second one. Indeed, take the optimal distributions provided by Theorems 1 and 2 and plug them back as R into the upper bounds of Lemma 1. Also assume that Q can match R exactly, i.e. Df(Q R) = 0. In this case both sides of (7) are equal to Df((1 β)Pg + βQ β Pd), which is the optimal value for the original objective (5). On the other hand, (8) does not become an equality and the r.h.s. is not the optimal one for (5). However, earlier we agreed that our aim is to reach the modest goal (6) and next we show that this is indeed possible.Corollaries 1 and 2 provide sufficient conditions for strict improvements when we use the upper bounds (8) and (7) respectively. Corollary 1 Given Pd, Pg, and some β (0, 1], assume Pd d Pg d Pd = 0 < β. Let Q β be as defined in Theorem 2. If Q is such that Df(Q Q β) γDf(Pg Pd) (10) for γ [0, 1], then Df((1 β)Pg + βQ Pd) (1 β(1 γ))Df(Pg Pd). Corollary 2 Let f FH. Take any β (0, 1], Pd, Pg, and let Q β be as defined in Theorem 1. If Q is such that Df(Q Q β) γDf(Pg Pd) (11) for some γ [0, 1], then Df((1 β)Pg + βQ Pd) Cγ,β Df(Pg Pd) , where Cγ,β = γβ + 1 β 2 is strictly smaller than 1 as soon as γ < β/4 (and β > 0). Conditions 10 and 11 may be compared to the weak learnability condition of Ada Boost. As long as our weak learner is able to solve the surrogate problem (3) of matching respectively Q β or Q β accurately enough, the original objective (5) is guaranteed to decrease as well. It should be however noted that Condition 11 with γ < β/4 is perhaps too strong to call it weak learnability . Indeed, as already mentioned before, the weight β is expected to decrease to zero as the number of components in the mixture distribution Pg increases. This leads to γ 0, making it harder to meet Condition 11. This obstacle may be partially resolved by the fact that we will use a GAN to fit Q, which corresponds to a relatively rich6 class of models G in (3). In other words, our weak learner is not so weak. On the other hand, Condition 10 of Corollary 1 is milder. No matter what γ [0, 1] and β (0, 1] are, the new component Q is guaranteed to strictly improve the objective functional. This comes at the price of the additional condition Pd(d Pg/d Pd = 0) < β, which asserts that β should be larger than the mass of true data Pd missed by the current model Pg. We argue that this is a rather reasonable condition: if Pg misses many modes of Pd we would prefer assigning a relatively large weight β to the new component Q. However, in practice, both Conditions 10 and 11 are difficult to check. A rigorous analysis of situations when they are guaranteed is a direction for future research. 6The hardness of meeting Condition 11 of course largely depends on the class of models G used to fit Q in (3). For now we ignore this question and leave it for future research. We now describe the functions Choose Mixture Weight and Update Training Weights of Algorithm 1. The complete Ada GAN meta-algorithm with the details of Update Training Weight and Choose Mixture Weight, is summarized in Algorithm 3 of Appendix A. Update Training Weights At each iteration we add a new component Q to the current mixture Pg with weight β. The component Q should approach the optimal target Q β provided by (9) in Theorem 1. This distribution depends on the density ratio d Pg/d Pd, which is not directly accessible, but it can be estimated using adversarial training. Indeed, we can train a separate mixture discriminator DM to distinguish between samples from Pd and samples from the current mixture Pg. It is known [13] that for an arbitrary f-divergence, there exists a corresponding function h such that the values of the optimal discriminator DM are related to the density ratio by d Pg d Pd (x) = h DM(x) . (12) We can replace d Pg(x)/d Pd(x) in (9) with h DM(x) . For the Jensen-Shannon divergence, used by the original GAN algorithm, h(z) = 1 z z . In practice, when we compute d Q β on the training sample SN = (X1, . . . , XN), each example Xi receives weight wi = 1 βN λ (1 β)h(di) + , where di = DM(Xi) . (13) The only remaining task is to determine λ . As the weights wi in (13) must sum to 1, we get: i I(λ ) pih(di) where I(λ) := {i : λ > (1 β)h(di)}. To find I(λ ), we sort h(di) in increasing order: h(d1) . . . h(d N). Then I(λ ) is a set consisting of the first k indices. We then successively test all k-s until the λ given by (14) verifies (1 β)h(dk) < λ (1 β)h(dk+1) . This procedure is guaranteed to converge by Theorem 1. It is summarized in Algorithm 2 of Appendix A Choose Mixture Weight For every β there is an optimal re-weighting scheme with weights given by (13). If the GAN could perfectly approximate its target Q β, then choosing β = 1 would be optimal, because Q 1 = Pd. But in practice, GANs cannot do that. So we propose to choose β heuristically by imposing that each generator of the final mixture model has same weight. This yields βt = 1/t, where t is the iteration index. Other heuristics are proposed in Appendix B, but did not lead to any significant difference. The optimal discriminator In practice it is of course hard to find the optimal discriminator DM achieving the global maximum of the variational representation for the f-divergence and verifying (12). For the JS-divergence this would mean that DM is the classifier achieving minimal expected crossentropy loss in the binary classification between Pg and Pd. In practice, we observed that the reweighting (13) leads to the desired property of emphasizing at least some of the missing modes as long as DM distinguishes reasonably between data points already covered by the current model Pg and those which are still missing. We found an early stopping (while training DM) sufficient to achieve this. In the worst case, when DM overfits and returns 1 for all true data points, the reweighting simply leads to the uniform distribution over the training set. 4 Experiments We ran Ada GAN7 on toy datasets, for which we can interpret the missing modes in a clear and reproducible way, and on MNIST, which is a high-dimensional dataset. The goal of these experiments was not to evaluate the visual quality of individual sample points, but to demonstrate that the re-weighting scheme of Ada GAN promotes diversity and effectively covers the missing modes. 7Code available online at https://github.com/tolstikhin/adagan Toy Datasets Our target distribution is a mixture of isotropic Gaussians over R2. The distances between the means are large enough to roughly avoid overlaps between different Gaussian components. We vary the number of modes to test how well each algorithm performs when there are fewer or more expected modes. We compare the baseline GAN algorithm with Ada GAN variations, and with other meta-algorithms that all use the same underlying GAN procedure. For details on these algorithms and on the architectures of the underlying generator and discriminator, see Appendix B. To evaluate how well the generated distribution matches the target distribution, we use a coverage metric C. We compute the probability mass of the true data covered by the model Pmodel. More precisely, we compute C := Pd(d Pmodel > t) with t such that Pmodel(d Pmodel > t) = 0.95. This metric is more interpretable than the likelihood, making it easier to assess the difference in performance of the algorithms. To approximate the density of Pmodel we use a kernel density estimation, where the bandwidth is chosen by cross validation. We repeat the run 35 times with the same parameters (but different random seeds). For each run, the learning rate is optimized using a grid search on a validation set. We report the median over those multiple runs, and the interval corresponding to the 5% and 95% percentiles. Figure 2 summarizes the performance of algorithms as a function of the number of iterations T. Both the ensemble and the boosting approaches significantly outperform the vanilla GAN and the best of T algorithm. Interestingly, the improvements are significant even after just one or two additional iterations (T = 2 or 3). Our boosting approach converges much faster. In addition, its variance is much lower, improving the likelihood that a given run gives good results. On this setup, the vanilla GAN approach has a significant number of catastrophic failures (visible in the lower bounds of the intervals). Further empirical results are available in Appendix B, where we compared Ada GAN variations to several other baseline meta-algorithms in more details (Table 1) and combined Ada GAN with the unrolled GANs (UGAN) [4] (Figure 3). Interestingly, Figure 3 shows that Ada GAN ran with UGAN outperforms the vanilla UGAN on the toy datasets, demonstrating the advantage of using Ada GAN as a way to further improve the mode coverage of any existing GAN implementations. Figure 1: Coverage C of the true data by the model distribution P T model, as a function of iterations T. Experiments correspond to the data distribution with 5 modes. Each blue point is the median over 35 runs. Green intervals are defined by the 5% and 95% percentiles (see Section 4). Iteration 0 is equivalent to one vanilla GAN. The left plot corresponds to taking the best generator out of T runs. The middle plot is an ensemble GAN, simply taking a uniform mixture of T independently trained GAN generators. The right plot corresponds to our boosting approach (Ada GAN), with βt = 1/t. MNIST and MNIST3 We ran experiments both on the original MNIST and on the 3-digit MNIST (MNIST3) [5, 4] dataset, obtained by concatenating 3 randomly chosen MNIST images to form a 3-digit number between 0 and 999. According to [5, 4], MNIST contains 10 modes, while MNIST3 contains 1000 modes, and these modes can be detected using the pre-trained MNIST classifier. We combined Ada GAN both with simple MLP GANs and DCGANs [19]. We used T {5, 10}, tried models of various sizes and performed a reasonable amount of hyperparameter search. Similarly to [4, Sec 3.3.1] we failed to reproduce the missing modes problem for MNIST3 reported in [5] and found that simple GAN architectures are capable of generating all 1000 numbers. The authors of [4] proposed to artificially introduce the missing modes again by limiting the generators flexibility. In our experiments, GANs trained with the architectures reported in [4] were often generating poorly looking digits. As a result, the pre-trained MNIST classifier was outputting random labels, which again led to full coverage of the 1000 numbers. We tried to threshold the confidence of the pre-trained classifier, but decided that this metric was too ad-hoc. Figure 2: Digits from the MNIST dataset corresponding to the smallest (left) and largest (right) weights, obtained by the Ada GAN procedure (see Section 3) in one of the runs. Bold digits (left) are already covered and next GAN will concentrate on thin (right) digits. For MNIST we noticed that the re-weighted distribution was often concentrating its mass on digits having very specific strokes: on different rounds it could highlight thick, thin, vertical, or diagonal digits, indicating that these traits were underrepresented in the generated samples (see Figure 2). This suggests that Ada GAN does a reasonable job at picking up different modes of the dataset, but also that there are more than 10 modes in MNIST (and more than 1000 in MNIST3). It is not clear how to evaluate the quality of generative models in this context. We also tried to use the inversion metric discussed in Section 3.4.1 of [4]. For MNIST3 we noticed that a single GAN was capable of reconstructing most of the training points very accurately both visually and in the ℓ2-reconstruction sense. The inversion metric tests whether the trained model can generate certain examples or not, but unfortunately it does not take into account the probabilities of doing so. 5 Conclusion We studied the problem of minimizing general f-divergences with additive mixtures of distributions. The main contribution of this work is a detailed theoretical analysis, which naturally leads to an iterative greedy procedure. On every iteration the mixture is updated with a new component, which minimizes f-divergence with a re-weighted target distribution. We provided conditions under which this procedure is guaranteed to converge to the target distribution at an exponential rate. While our results can be combined with any generative modelling techniques, we focused on GANs and provided a boosting-style algorithm Ada GAN. Preliminary experiments show that Ada GAN successfully produces a mixture which iteratively covers the missing modes. [1] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014. [2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014. [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. ar Xiv:1701.07875, 2017. [4] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. ar Xiv:1611.02163, 2017. [5] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. ar Xiv:1612.02136, 2016. [6] Max Welling, Richard S. Zemel, and Geoffrey E. Hinton. Self supervised boosting. In Advances in neural information processing systems, pages 665 672, 2002. [7] Zhuowen Tu. Learning generative models via discriminative approaches. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1 8. IEEE, 2007. [8] Aditya Grover and Stefano Ermon. Boosted generative models. ICLR 2017 conference submission, 2016. [9] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125 139, 2001. [10] Saharon Rosset and Eran Segal. Boosting density estimation. In Advances in Neural Information Processing Systems, pages 641 648, 2002. [11] A Barron and J Li. Mixture density estimation. Biometrics, 53:603 618, 1997. [12] Yaxing Wang, Lichao Zhang, and Joost van de Weijer. Ensembles of generative adversarial networks. ar Xiv:1612.00991, 2016. [13] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, 2016. [14] F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, 2008. [15] M. D. Reid and R. C. Williamson. Information, divergence and risk for binary experiments. Journal of Machine Learning Research, 12:731 817, 2011. [16] Bent Fuglede and Flemming Topsoe. Jensen-shannon divergence and hilbert space embedding. In IEEE International Symposium on Information Theory, pages 31 31, 2004. [17] Matthias Hein and Olivier Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In AISTATS, pages 136 143, 2005. [18] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. [19] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.