# instance_selection_for_gans__97573133.pdf

Instance Selection for GANs

Terrance De Vries University of Guelph Vector Institute

Michal Drozdzal Facebook AI Research Graham W. Taylor University of Guelph Vector Institute

Recent advances in Generative Adversarial Networks (GANs) have led to their widespread adoption for the purposes of generating high quality synthetic imagery. While capable of generating photo-realistic images, these models often produce unrealistic samples which fall outside of the data manifold. Several recently proposed techniques attempt to avoid spurious samples, either by rejecting them after generation, or by truncating the model s latent space. While effective, these methods are inefﬁcient, as a large fraction of training time and model capacity are dedicated towards samples that will ultimately go unused. In this work we propose a novel approach to improve sample quality: altering the training dataset via instance selection before model training has taken place. By reﬁning the empirical data distribution before training, we redirect model capacity towards high-density regions, which ultimately improves sample ﬁdelity, lowers model capacity requirements, and signiﬁcantly reduces training time. Code is available at https://github.com/uoguelph-mlrg/instance_selection_for_gans.

1 Introduction

Recent advances in Generative Adversarial Networks (GANs) have enabled these models to be considered a tool of choice for vision synthesis tasks that demand high ﬁdelity outputs, such as image and video generation [6, 12], image editing [41], inpainting [35], and superresolution [32]. However, when sampling from a trained GAN model, outputs may be unrealistic just as often as they appear photo-realistic.

GANs ﬁt a model to a data distribution with the help of a discriminator network. Low quality samples produced by these models are often attributed to poor modeling of the low-density regions of the data manifold [11]. The majority of current techniques attempt to eliminate low quality samples after the model is trained, either by changing the model distribution by truncating the latent space [2, 11] or by performing some form of rejection sampling using a trained discriminator to inform the rejection process [1, 5, 31]. Nevertheless, these methods are inefﬁcient with respect to model capacity and training time, since much of the capacity and optimization efforts dedicated to representing the sparse regions of the data manifold are wasted.

In this paper, we analyze the use of instance selection [21] in the generative setting. We address the problem of uneven model sample quality before GAN model training has begun, rather than after it has ﬁnished. We note that dataset collection is a noisy process, and that many of the currently used datasets for generative model training and evaluation were not purposely created for this task. Thus, through a dataset curation step, we remove low density regions from the data manifold prior to model optimization and show that this direct dataset intervention (1) improves overall image sample quality in exchange for some reduction in diversity, (2) lowers model capacity requirements, and (3) reduces training time. To remove the sparsest parts of the image manifold, images are ﬁrst projected into an embedding space of perceptually meaningful representations. A scoring function is then ﬁt to asses the manifold density in the neighbourhood of each embedded data point in the dataset. Finally, data points with the lowest manifold density scores are removed from the dataset. In

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

our experiments, we evaluate a variety of image embeddings and scoring functions, observing that Inceptionv3 and Gaussian likelihood are well suited for the respective roles. Overall, we make the following contributions:

We propose dataset curation via instance selection to improve the output quality of GANs. We show that the manifold density in the perceptual embedding space of a given dataset is predictive of GAN performance, and therefore a good scoring function for instance selection. We demonstrate the model capacity savings of instance selection by achieving state-ofthe-art performance (in terms of FID) on 64 64 resolution Image Net generation using a Self-Attention GAN with 1/2 the amount of trainable parameters of the current best model. We demonstrate training time savings by training a 128 128 resolution Big GAN on Image Net in 1/4 the time of the baseline, while also achieving superior performance across all image ﬁdelity metrics. We exhibit the overall computational savings of instance selection by training a 256 256 resolution Big GAN on Image Net with only 4 V100 GPUs in 11 days. Our model achieves better image ﬁdelity than the baseline model while using 1/2 as many trainable parameters.

2 Related Work

Generative modelling of images is a very challenging problem due to the high dimensional nature of images and the complexity of the distributions they form. Several different approaches towards image generation have been proposed, with GANs currently the state-of-the-art in terms of image generation quality. In this work we will focus primarily on GANs, but other types of generative models might also beneﬁt from instance selection prior to model ﬁtting.

2.1 Sample Filtering in GANs

One way to improve the sample quality from GANs without making any changes to the architecture or optimization algorithm is by applying techniques which automatically ﬁlter out poor quality samples from a trained model. Discriminator Rejection Sampling (DRS) [1] accomplishes this by performing rejection sampling on the generator. This process is informed by the discriminator, which is reused to estimate density ratios between the real and generated image manifolds. Metropolis-Hastings GAN (MH-GAN) [31] builds on DRS by i) calibrating the discriminator to achieve more accurate density ratio estimates, and by ii) applying Markov chain Monte Carlo (MCMC) instead of rejection sampling for better performance on high dimensional data. Ding et al. [8] further improve density ratio estimates by ﬁne-tuning a pretrained Image Net classiﬁer for the task. For more efﬁcient sampling, Discriminator Driven Latent Sampling (DDLS) [5] iteratively updates samples in the latent space to push them closer to realistic outputs.

Instead of ﬁltering samples after the GAN has been trained, some methods do so during the training procedure. Latent Optimisation for Generative Adversarial Networks (LOGAN) [33] optimizes latent samples each iteration at the cost of an additional forward and backward pass. Sinha et al. [27] demonstrate that gradients from low quality generated samples drive the model away from the nearest mode rather than towards it. As such, gradients from the worst samples each iteration during training may be ignored to improve generation quality.

Perhaps the most well known approach for increasing sample ﬁdelity in GANs is the truncation trick [2, 11, 16]. The truncation trick is used in the popular models Big GAN [2] and Style GAN [11, 12] to improve image quality by manipulating the latent distribution. The original truncation trick as used by Big GAN consists of replacing the latent distribution with a truncated distribution during inference, such that any latent sample that falls outside of some acceptable range is resampled. Style GAN uses a similar strategy by interpolating samples towards the mean of the latent space instead of resampling them. By moving samples closer to the interior regions of the latent space, sample diversity can effectively be traded for visual ﬁdelity. Our instance selection technique has an effect similar to the truncation trick, but with the added beneﬁt of also reducing model capacity and training time requirements.

2.2 Instance Selection

Instance selection is a data preprocessing technique commonly used in the classiﬁcation setting to select a subset of data from a larger collection [21]. In general, instance selection methods either attempt to reduce the size of the dataset to a more manageable size while retaining informative data points, or try to clean the dataset by eliminating noisy data points. Though commonly used in the setting of big data, instance selection has received little attention from the generative modelling community. Nuha et al. [20] explore the impact of reducing the size of the training set when training GANs. However, they select data points randomly, and no signiﬁcant improvement in performance is observed from the removal of data. Core-set selection has been shown to be useful for improving GAN performance when training with small mini-batches, but it ultimately does not improve image ﬁdelity over large mini-batch training [26]. Whereas core-set selection attempts to select mini-batches that mimic the distribution of the original dataset, our proposed technique purposefully redeﬁnes the target distribution so as to maximize the density of the data manifold.

3 Instance Selection for GANs

In the context of generative modeling, our motivation is to automatically remove the sparsest regions of the data manifold, speciﬁcally those parts that GANs struggle to capture. To do so, we deﬁne an image embedding function F and a scoring function H.

Embedding function F projects images into an embedding space. More precisely, given a dataset of images X, the dataset of embedded images Z is obtained by applying the embedding function z = F(x) to each data point x X. For the task of image generation we suggest using perceptually aligned embedding functions [37], such as the feature space of a pretrained image classiﬁer.

Scoring function H is used to to assess the manifold density in a neighbourhood around each embedded data point z. In our experiments, we compare three choices of scoring function: log likelihood under a standard Gaussian model, log likelihood under a Probabilistic Principal Component Analysis (PPCA) [29] model, and distance to the Kth nearest neighbour (KNN Distance). We select Gaussian and PPCA as simple, well known density models. KNN Distance has previously been used as a measure of local manifold density in classical instance selection [3], and has been shown to be useful for deﬁning non-linear image manifolds [14, 19].

The Gaussian model is ﬁt to the embedded dataset by computing the empirical mean µ and the sample covariance Σ of Z. The score of each embedded image z is computed as follows:

HGaussian(z) = 1

2[ln(|Σ|) + (z µ)T Σ 1(z µ) + d ln(2π)], (1)

where d is the dimension of z.

PPCA is ﬁt to the embedded dataset using any standard PPCA solver [22]. We set the number of principal components such that 95% of the variance in the data is preserved. Embedded images are scored as follows:

HPPCA(z) = 1

2[ln(|C|) + Tr((z µ)T C 1(z µ)) + d ln(2π)], C = WWT + σ2I, (2)

where W is the ﬁt model weight matrix, µ is the empirical mean of Z, σ is the residual variance, I is the identity matrix, and d is the dimension of z.

KNN Distance is used to score data points by calculating the Euclidean distance between z and Z \ {z}, then returning the distance to the Kth nearest element. To convert to a score, we make the resulting distance negative, such that smaller distances return larger values. Formally, we can evaluate:

HKNN(z, K, Z) = min K

n ||z zi||2 : zi Z \ {z} o , (3)

where min K is deﬁned as the Kth smallest value in a set. In our experiments we set K = 5.

To perform instance selection, we compute scores H(F(x)) for each data point and keep all data points with scores above some threshold ψ. For convenience, we often set ψ to be equal to some percentile of the scores, such that we preserve the top N% of the best scoring data points. Thus,

(a) Images with highest likelihood

(b) Images with least likelihood

Figure 1: Examples of the (a) most and (b) least likely resized images of red foxes from the Image Net dataset, as determined by a Gaussian model ﬁt on images in an Inceptionv3 embedding space. High likelihood images share a similar visual structure, while low likelihood samples are more varied.

given an initial training set consisting of data points x X we construct our reduced training set X by computing: X = {x X s.t. H(F(x)) > ψ}. (4)

To illustrate why removing data points from the training set might be a good idea, we look at the most and least likely images from the red fox class of Image Net (Figure 1). Likelihood is determined by a Gaussian model ﬁt on feature embeddings from a pretrained Inceptionv3 classiﬁer. We notice a stark contrast between the content of the images. The most likely images (a) are similarly cropped around the fox s face, while the least likely images (b) have many odd viewpoints and often suffer from occlusion. It is logical to imagine how a generative model trained on these unusual instances may try to generate samples that mimic such conditions, resulting in undesirable outputs.

4 Experiments

In this section we review evaluation metrics, motivate selecting instances based on manifold density, and then analyze the impact of applying instance selection to GAN training.

4.1 Evaluation Metrics

We use a variety of evaluation metrics to diagnose the effect that training with instance selection has on the learned distribution, including: (1) Inception Score (IS) [24], (2) Fréchet Inception Distance (FID) [10], (3) Precision and Recall (P&R) [14], and (4) Density and Coverage (D&C) [19]. In all cases where a reference distribution is required we use the original training distribution. Using the distribution produced after instance selection would unfairly favour the evaluation of instance selection, since the reference distribution could be changed to one that is trivially easy to generate. A detailed description of each evaluation metric is provided in the supplementary material ( A).

When calculating FID we follow Brock et al. [2] in using all images in the training set to estimate the reference distribution, and sampling 50 k images to make up the generated distribution. For P&R and D&C we use an Inceptionv3 embedding.1 N and M are set to 10 k samples for both the reference and generated distributions, and K is set equal to 5 as recommended by Naeem et al. [19].

4.2 Relationship Between Dataset Manifold Density and GAN Performance

An image manifold is more accurately deﬁned in regions where many data points are in close proximity to each other [14]. Since GANs attempt to reproduce an image manifold based on data points from a given dataset, we suspect that they should perform better on datasets with well-deﬁned manifolds (i.e. no sparse manifold regions). To verify this hypothesis, we use the Image Net2 dataset [7] and treat each of the 1000 classes as a separate dataset. Ideally, we would ﬁt a separate GAN on each class to obtain a ground truth measure of performance, but this is very computationally expensive. Instead, we use a single class-conditional Big GAN from [2] that has been pretrained on Image Net at 128 128 resolution. For each class, we sample 700 real images from the dataset, and generate 700 class-conditioned samples with the Big GAN. To measure the density for each class manifold we compare three different methods: Gaussian likelihood, Probabilistic Principal Component Analysis

1We use the Py Torch pretrained Inceptionv3 embedding for all metrics. 2Use of Image Net is only for noncommercial, research purposes, and not for training networks deployed in production or for other commercial uses.

8800 8600 8400 8200 8000 Negative Log Likelihood

Gaussian, Correlation = 0.774

3000 2500 2000 1500 1000 Negative Log Likelihood

PPCA, Correlation 0.804

4.0 4.5 5.0 5.5 Log Distance

KNN Distance (K=5), Correlation = 0.774

Figure 2: Correlation between manifold density estimates and FID for each class in the Image Net dataset. Lower values on the x-axis indicate a more dense dataset manifold. Lower values on the y-axis indicate better quality generated samples.

(PPCA) likelihood, and distance to the Kth neighbour (KNN Distance) ( 3). Images are projected into the feature space of an Inceptionv3 model, and a manifold density score is computed on the features using one of our scoring functions. As an indicator of the true GAN output quality we compute FID between the real and generated distributions for each class.

We observe a strong correlation between each of the manifold density measures and GAN output quality (Figure 2). This correlation conﬁrms our hypothesis, suggesting that dataset manifold density is an important factor for achieving high quality generated samples with GANs.

4.3 Embedding and Scoring Function

Having established that dataset manifold density is correlated with GAN performance, we explore artiﬁcially increasing the overall density of the training set by removing data points that lie in low density regions of the data manifold. To this end, we train several Self-Attention GANs (SAGAN) [36] on Image Net at 64 64 resolution. Each model is trained on a different 50% subset of Image Net, as chosen by instance selection using different embedding and scoring functions as described in 3. Instance selection is applied per-class. We use the default settings for SAGAN, except that we use a batch size of 128 instead of 256, apply the self-attention module at 32 32 resolution instead of 64 64, and reduce the number of channels in each layer by half in order to reduce the computational cost of our initial exploratory experiments. All models are trained for 200k iterations. The results of these experiments are shown in Table 1. For reference, we include scores achieved by real (i.e. not generated) data in Table 5 in the supplementary material.

Table 1: Comparison of embedding and scoring functions on 64 64 Image Net image generation task. All tests train a SAGAN model for 200k iterations. Models trained with instance selection signiﬁcantly outperform models trained without instance selection, despite training on a fraction of the available data. RR is the retention ratio (percentage of dataset trained on). Best results in bold.

Instance Selection RR (%) Embedding Pretraining IS FID P R D C

None 100 - - 15.4 21.4 0.66 0.62 0.64 0.64 Uniform 50 - - 15.5 22.8 0.65 0.62 0.65 0.65

Gaussian 50 Inceptionv3 Image Net 25.7 12.6 0.77 0.59 0.97 0.83 PPCA 50 Inceptionv3 Image Net 25.5 13.2 0.76 0.58 0.97 0.82 KNN Dist 50 Inceptionv3 Image Net 25.4 13.1 0.76 0.58 0.97 0.82

Gaussian 50 Inceptionv3 Random init 15.5 21.9 0.66 0.61 0.68 0.65 Gaussian 50 Res Net-50 Places365 20.6 16.5 0.74 0.59 0.88 0.76 Gaussian 50 Res Net-50 Sw AV 20.3 16.7 0.74 0.57 0.89 0.76 Gaussian 50 Res Net-50 Image Net 22.0 14.6 0.76 0.59 0.92 0.79 Gaussian 50 Res Ne Xt-101 Instagram 1B 24.1 14.1 0.73 0.61 0.86 0.80

All runs utilizing instance selection signiﬁcantly outperform the baseline model trained on the full dataset, despite only having access to half as much training data (Table 1). We observe a large increase in image ﬁdelity, as indicated by the improvements in Inception Score, Precision, and Density, and a slight drop in overall diversity, as measured by Recall. Coverage, which measures realism-constrained diversity, beneﬁts greatly from the more realistic samples and thus sees an increase, despite the reduction in overall diversity. Since the increase in image quality is much greater than the decrease in diversity, FID also improves. To verify that the gains are not simply caused by the reduction in dataset size we train a model on a 50% subset that was uniform-randomly sampled from the full dataset. Here, we observe little change in performance compared to the baseline, indicating that performance improvements are indeed due to careful selection of training data, rather than the reduction of dataset size.

We ﬁnd that all three candidate scoring functions: Gaussian likelihood, PPCA likelihood, and KNN distance, signiﬁcantly outperform the full dataset baseline. Gaussian likelihood slightly outperforms the alternatives, so we use it as the scoring function in the remainder of our experiments.

To understand the importance of the embedding function, we compare several different model embeddings that have been trained on different datasets: Inceptionv3 [28] trained on Image Net, Res Net50 [9] trained on Places365 [40], Image Net, and with Sw AV unsupervised pretraining [4], and Res Ne Xt-101 32x8d [34] trained with weak supervision on Instagram 1B [15]. We also compare a randomly initialized Inceptionv3 with no pretraining as a random embedding. For all architectures, features are extracted after the global average pooling layer. We ﬁnd that all feature embeddings improve performance over the full dataset baseline except for the randomly initialized network. These results suggest that an embedding function that is well aligned with the target domain is required in order for instance selection to be effective. The Image Net pretrained Inceptionv3 embedding performs best overall, and was chosen as the embedding function for the rest of our experiments. We note that using an Inceptionv3 embedding both in instance selection and in the evaluation metrics may yield some non-negligible advantage in evaluation, since selected instances are those that the network prefers.

4.4 Retention Ratio

An important consideration when performing instance selection is determining what proportion of the original dataset to keep, a hyperparameter which we call retention ratio. To investigate the impact of the retention ratio on training, we train ten SAGANs on Image Net, each retaining different amounts of the original dataset in 10% intervals. GAN hyperparameters are the same as in 4.3, except that we extend training until 500k iterations in order to observe model behaviours over a longer training window. Results are shown in Figure 3 and Table 6 in the supplementary material.

0 100 200 300 400 500 Iterations ( 103)

Inception Score

0 100 200 300 400 500 Iterations ( 103)

0 100 200 300 400 500 Iterations ( 103)

0 100 200 300 400 500 Iterations ( 103)

0 100 200 300 400 500 Iterations ( 103)

0 100 200 300 400 500 Iterations ( 103)

Retention Ratio

100 90 80 70 60 50 40 30 20 10

Figure 3: SAGAN trained on 64 64 Image Net, with instance selection used to reduced the dataset by varying amounts. Retention ratio = 100 indicates a model trained on the full dataset (i.e. no instance selection). The application of instance selection boosts overall performance signiﬁcantly.

(a) Baseline (trained on 100% of dataset)

(b) Instance Selection (trained on 40% of dataset)

Figure 4: Samples of bird classes from SAGAN trained on 64 64 Image Net. Each row is conditioned on a different class. Red borders indicate misclassiﬁcation by a row-speciﬁc pretrained Inceptionv3 classiﬁer. Instance selection (b) signiﬁcantly improves sample ﬁdelity and class consistency compared to the baseline (a).

As larger portions of the original dataset are removed we see consistent improvements in image ﬁdelity (increasing Inception Score, Precision, and Density) and reductions in sample diversity (decreasing Recall). Interestingly, metrics which take into account both realism and diversity (FID and Coverage) continue to see gains until roughly 70% of the dataset has been removed, at which point they begin to decrease. This behaviour suggests that, given the ability of current state-of-the-art models to learn from limited data, sample ﬁdelity is valued much more than diversity. When too much of the dataset is removed some models collapse prematurely, likely due to the discriminator quickly overﬁtting the small training set. It is expected that applying data augmentation could resolve this issue [13, 38]. To further improve image ﬁdelity, instance selection could be combined with the truncation trick ( E).

Our best performing SAGAN model in terms of FID was trained on only 40% of the Image Net dataset, yet outperforms FQ-Big GAN [39], the current state-of-the-art model for the task of 64 64 Image Net generation. Despite using 2 less parameters and a 4 smaller batch size, our SAGAN achieves a better FID (9.07 vs. 9.76). As indicated by these scores and the errors made by a pretrained classiﬁer, samples from our instance selection model are signiﬁcantly more recognizable than those from the baseline model trained on the full dataset (Figure 4).

4.5 128 128 Image Net

To examine the impact of instance selection on the training time of large-scale models, we train two Big GAN models on 128 128 Image Net3. Our baseline model uses the default hyperparameters from Big GAN [2], with the exception that we reduce the channel multiplier from 96 to 64 (i.e. half of the capacity) and only use a single discriminator update instead of two for faster training. Our instance selection model uses the same settings as the baseline, but is trained on 50% of the dataset. Although large batch sizes are critical for achieving good performance with the baseline Big GAN [2], we found them to degrade performance when combined with instance selection. Therefore, we reduce the batch size from Big GAN s default of 2048 to 256 for the instance selection model. Both models are trained on 8 NVIDIA V100 GPUs with 16GB of RAM, using gradient accumulation to achieve the necessary batch sizes.

Despite using a much smaller batch size, our model trained with instance selection outperforms the baseline in all metrics except for Recall (Table 2), as expected due to the diversity/ﬁdelity trade-off. The instance selection model trains signiﬁcantly faster than the baseline, requiring less than four days while the baseline requires more than two weeks.

Table 2: Performance of models on the 128 128 Image Net image generation task. Both models use a channel multiplier of 64 and a single discriminator update per generator update. The baseline model uses a batch size of 2048, while the instance selection model uses a batch size of 256.

Model IS FID P R D C Time Hardware

Big GAN 68.8 11.5 0.76 0.66 0.9 0.84 14.8 days 8 V100 Big GAN + Inst. Sel. 114.3 9.6 0.88 0.50 1.34 0.90 3.7 days 8 V100

3We use the ofﬁcial Big GAN implementation from https://github.com/ajbrock/Big GAN-Py Torch.

(a) Baseline (trained on 100% of dataset)

(b) Instance Selection (trained on 50% of dataset)

Figure 5: Samples from Big GAN trained on 256 256 Image Net, with the truncation trick. Samples are selected to demonstrate the highest quality outputs for each model. The baseline model (a) struggles to produce convincing facial details, which the instance selection model (b) successfully achieves. Zoom in for best viewing.

4.6 256 256 Image Net

To further demonstrate instance selection we train a Big GAN on Image Net at 256 256 resolution using 4 V100s with 32GB of RAM each. Since training a baseline model without instance selection with the same hardware setup would take an excessively long time (1-2 months), we instead compare to the 256 256 Big GAN from Brock et al. [2] using the ofﬁcial pretrained weights4. Compared to this baseline, our model uses half the capacity (channel multiplier reduced from 96 to 64), 8 smaller batch size (from 2048 to 256), and applies the self-attention block in the generator at a resolution of 64 64 instead of 128 128. The retention ratio for instance selection is set to 50%. Similar to the baseline, we use two discriminator update steps per generator update for this experiment. Quantitative results are presented in Table 3, and samples are shown in Figure 5 and G in the supplementary material.

Our instance selection model trains in less than 11 days, and uses approximately one order of magnitude less multiply-accumulate operations (MACS) than the baseline throughout the duration of training. Despite having half as much capacity, our model outperforms the baseline in all image ﬁdelity focused metrics (Inception Score, Precision, and Density), and achieves comparable performance on metrics that jointly consider image quality and diversity (FID and Coverage). As expected, the better image quality comes at the cost of overall sample diversity (indicated by Recall). To our knowledge, this is the ﬁrst time photorealistic generation of 256 256 Image Net images has been achieved without the use of specialized hardware (i.e. hundreds of TPUs).

Table 3: Performance of models for 256 256 Image Net image generation. The instance selection model uses half as many parameters as the baseline model. All metrics are computed using Py Torch Inceptionv3 embeddings, and may therefore differ from numbers computed with Tensor Flow.

Model IS FID P R D C Time Hardware

Big GAN 135.4 9.8 0.86 0.70 1.18 0.92 1-2 days 256 TPUv3 Big GAN + Inst. Sel. 165.3 10.6 0.91 0.52 1.48 0.93 10.7 days 4 V100

4Pretrained Big GAN weights from https://colab.research.google.com/github/tensorflow/ hub/blob/master/examples/colab/biggan_generation_with_tf_hub.ipynb

5 Instance Selection in Practice

As the experiments have shown, instance selection stands as a useful tool for trading away sample diversity in exchange for improvements in image ﬁdelity, faster training, and lower model capacity requirements. We believe that this trade-off is a worthwhile hyperparameter to tune in consideration of the available compute budget, just as it is common practice to adjust model capacity or batch size to ﬁt within the memory constraints of the available hardware.

The control over the diversity/ﬁdelity trade-off afforded by instance selection also yields a tool that can be used to better understand the behaviour and limitations of existing evaluation metrics. For instance, in some cases when applying instance selection, we observed that certain diversity-sensitive metrics (such as FID and Coverage) improved, even though the diversity of the training set had been signiﬁcantly reduced. We leave it for future work to determine whether this is a limitation of these metrics, or a behaviour that should be expected.

Finally, instance selection can be used to automatically curate new datasets for the task of image generation. Existing datasets that are designed for image synthesis often use manual ﬁltering and hand-crafted cropping and alignment tools to increase the dataset manifold density [11]. As an alternative to these time-intensive procedures, instance selection provides a generic solution that can quickly be applied to any uncurated set of images.

6 Conclusion

Folk wisdom suggests more data is better, however, it is known that areas of the data manifold that are sparsely represented pose a challenge to current GANs [11]. To directly address this challenge we introduce a new tool: dataset curation via instance selection. Our motivation is to remove sparse regions of the data manifold before training, acknowledging that they will ultimately be poorly represented by the GAN, and therefore, that attempting to capture them is an inefﬁcient use of model capacity. Moreover, popular post-processing methods such as rejection sampling or latent space truncation will likely ignore these regions as represented by the model. There are multiple beneﬁts of taking the instance selection approach: (1) We improve sample ﬁdelity across a variety of metrics compared to training on uncurated data; (2) We demonstrate that reallocating model capacity to denser regions of the data manifold leads to efﬁciency gains, meaning that we can achieve SOTA quality with smaller-capacity models trained in far less time. To our knowledge, instance selection has not yet been formally analyzed in the generative setting. However, we argue that it is more important here than in supervised learning because of the absence of an annotation phase where humans often perform some kind of formal or informal curation.

We have only considered the setting where curation is performed up-front, prior to training. However, our results suggest that dynamic curation, including curriculum learning informed by the kinds of perceptually aligned embeddings we consider here, is an interesting direction for future work.

Broader Impact

The application of instance selection to the task of image generative modelling brings with it several beneﬁts. Gains in image generation quality are an obvious improvement, but perhaps more impactful to the broader community are the reductions in model capacity and training time that are afforded. Reducing the computational barrier to entry for training large-scale generative models provides many individuals, including students, AI artists, and ML enthusiasts, with access to models that are otherwise restricted to only the most well resourced labs. In addition to greater accessibility, lowering the computational requirements for training large-scale generative models also reduces associated energy costs and CO2 emissions associated with the training process.

One side effect of our instance selection method is that, by nature of design, generated results are more likely to reﬂect the content that makes up the majority of the training set. As such, dataset bias is ampliﬁed as instances that are poorly represented in the dataset may be completely ignored. However, this limitation can be addressed by properly balancing the training set before instance selection is applied or alternatively, ensuring a more diverse & inclusive data collection effort to begin with.

As with any form of generative model, there is some potential for misuse. A common example is deepfakes , where a generative model is used to manipulate images or videos well enough that humans cannot distinguish real from fake. While often used to create humorous videos in which actors faces are swapped, deepfakes also have the potential for more nefarious uses, such as for blackmail or spreading misinformation. Fortunately, much recent effort has been dedicated towards automatic detection of these false images [30]. These techniques attempt to ﬁnd manipulated media by detecting inconsistencies, such as in the synchronization of lip movement and speech audio, or generation artifacts, such as missing reﬂections or other minute details.

Acknowledgments and Disclosure of Funding

The authors would like to thank Colin Brennan, with whom discussions about dataset learnability kicked off this project, and Brendan Duke, for being a constant sounding board. Resources used in preparing this research were provided to GWT and TD, in part, by NSERC, the Canada Foundation for Innovation, the Province of Ontario, the Government of Canada through CIFAR, Compute Canada, and companies sponsoring the Vector Institute: http://www.vectorinstitute.ai/#partners.

[1] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian J. Goodfellow, and Augustus Odena. Discriminator rejection sampling. Ar Xiv, abs/1810.06758, 2019.

[2] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis. ICLR, 2018.

[3] Joel Luis Carbonera and Mara Abel. A density-based approach for instance selection. In 2015 IEEE 27th International Conference on Tools with Artiﬁcial Intelligence (ICTAI), pages 768 774. IEEE, 2015.

[4] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. ar Xiv preprint ar Xiv:2006.09882, 2020.

[5] Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, and Yoshua Bengio. Your gan is secretly an energy-based model and you should use discriminator driven latent sampling. Ar Xiv, abs/2003.06060, 2020.

[6] Aidan Clark, Jeff Donahue, and Karen Simonyan. Efﬁcient video generation on complex datasets. ar Xiv preprint ar Xiv:1907.06571, 2019.

[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[8] Xin Ding, Z. Jane Wang, and William J. Welch. Subsampling generative adversarial networks: Density ratio estimation in feature space with softplus loss. IEEE Transactions on Signal Processing, 68:1910 1922, 2020.

[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[10] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626 6637, 2017.

[11] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401 4410, 2019.

[12] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. ar Xiv preprint ar Xiv:1912.04958, 2019.

[13] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. ar Xiv preprint ar Xiv:2006.06676, 2020.

[14] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, pages 3929 3938, 2019.

[15] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 181 196, 2018.

[16] Marco Marchesi. Megapixel size image creation using generative adversarial networks. ar Xiv preprint ar Xiv:1706.00082, 2017.

[17] Leland Mc Innes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. ar Xiv preprint ar Xiv:1802.03426, 2018.

[18] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? ar Xiv preprint ar Xiv:1801.04406, 2018.

[19] Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable ﬁdelity and diversity metrics for generative models. ar Xiv preprint ar Xiv:2002.09797, 2020.

[20] Fajar Ulin Nuha et al. Training dataset reduction on generative adversarial network. Procedia computer science, 144:133 139, 2018.

[21] J Arturo Olvera-López, J Ariel Carrasco-Ochoa, J Francisco Martínez-Trinidad, and Josef Kittler. A review of instance selection methods. Artiﬁcial Intelligence Review, 34(2):133 143, 2010.

[22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

[23] Suman Ravuri and Oriol Vinyals. Classiﬁcation accuracy score for conditional generative models. In Advances in Neural Information Processing Systems, pages 12268 12279, 2019.

[24] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234 2242, 2016.

[25] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. How good is my gan? In Proceedings of the European Conference on Computer Vision (ECCV), pages 213 229, 2018.

[26] Samarth Sinha, Han Zhang, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, and Augustus Odena. Small-gan: Speeding up gan training using core-sets. ar Xiv preprint ar Xiv:1910.13540, 2019.

[27] Samarth Sinha, Anirudh Goyal, Colin Raffel, and Augustus Odena. Top-k training of gans: Improving generators by making critics less critical. Ar Xiv, abs/2002.06224, 2020.

[28] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818 2826, 2016.

[29] Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611 622, 1999.

[30] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. Deepfakes and beyond: A survey of face manipulation and fake detection. ar Xiv preprint ar Xiv:2001.00179, 2020.

[31] Ryan C Turner, Jane Hung, Yunus Saatci, and Jason Yosinski. Metropolis-hastings generative adversarial networks. In ICML, 2018.

[32] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.

[33] Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, and Timothy P. Lillicrap. Logan: Latent optimisation for generative adversarial networks. Ar Xiv, abs/1912.00953, 2019.

[34] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492 1500, 2017.

[35] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505 5514, 2018.

[36] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. ar Xiv preprint ar Xiv:1805.08318, 2018.

[37] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586 595, 2018.

[38] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for dataefﬁcient gan training. ar Xiv preprint ar Xiv:2006.10738, 2020.

[39] Yang Zhao, Chunyuan Li, Ping Yu, Jianfeng Gao, and Changyou Chen. Feature quantization improves gan training. ar Xiv preprint ar Xiv:2004.02088, 2020.

[40] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

[41] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image editing. ar Xiv preprint ar Xiv:2004.00049, 2020.