# adversarially_robust_representations_with_smooth_encoders__4e6a9b6e.pdf Published as a conference paper at ICLR 2020 ADVERSARIALLY ROBUST REPRESENTATIONS WITH SMOOTH ENCODERS A. Taylan Cemgil, Sumedh Ghaisas, Krishnamurthy Dvijotham & Pushmeet Kohli Deep Mind, London {taylancemgil,sumedhg,dvij,pushmeet}@google.com This paper studies the undesired phenomena of over-sensitivity of representations learned by deep networks to semantically-irrelevant changes in data. We identify a cause for this shortcoming in the classical Variational Auto-encoder (VAE) objective, the evidence lower bound (ELBO). We show that the ELBO fails to control the behaviour of the encoder out of the support of the empirical data distribution and this behaviour of the VAE can lead to extreme errors in the learned representation. This is a key hurdle in the effective use of representations for data-efficient learning and transfer. To address this problem, we propose to augment the data with specifications that enforce insensitivity of the representation with respect to families of transformations. To incorporate these specifications, we propose a regularization method that is based on a selection mechanism that creates a fictive data point by explicitly perturbing an observed true data point. For certain choices of parameters, our formulation naturally leads to the minimization of the entropy regularized Wasserstein distance between representations. We illustrate our approach on standard datasets and experimentally show that significant improvements in the downstream adversarial accuracy can be achieved by learning robust representations completely in an unsupervised manner, without a reference to a particular downstream task and without a costly supervised adversarial training procedure. 1 INTRODUCTION Representation learning is a fundamental problem in Machine learning and holds the promise to enable data-efficient learning and transfer to new tasks. Researchers working in domains like Computer Vision (Krizhevsky et al., 2012) and Natural Language Processing (Devlin et al., 2018) have already demonstrated the effectiveness of representations and features computed by deep architectures for the solution of other tasks. A case in point is the example of the FC7 features from the Alex Net image classification architecture that have been used for many other vision problems (Krizhevsky et al., 2012). The effectiveness of learned representations has given new impetus to research in representation learning, leading to a lot of work being done on the development of techniques for inducing representations from data having desirable properties like disentanglement and compactness (Burgess et al., 2018; Achille & Soatto, 2017; Bengio, 2013; Locatello et al., 2019). Many popular techniques for generating representation are based on the Variational Auto Encoders (VAE) model (Kingma & Welling, 2013; Rezende et al., 2014). The use of deep networks as universal function approximators has facilitated very rapid advancements which samples generated from these models often being indistinguishable from natural data. While the quality of generated examples can provide significant convincing evidence that a generative model is flexible enough to capture the variability in the data distribution, it is far from a formal guarantee that the representation is fit for other purposes. In fact, if the actual goal is learning good latent representations, evaluating generative models only based on reconstruction fidelity and subjective quality of typical samples is neither sufficient nor entirely necessary, and can be even misleading. In this paper, we uncover the problematic failure mode where representations learned by VAEs exhibit over-sensitivity to semantically-irrelevant changes in data. One example of such problematic Published as a conference paper at ICLR 2020 behaviour can be seen in Figure 1. We identify a cause for this shortcoming in the classical Variational Auto-encoder (VAE) objective, the evidence lower bound (ELBO), that fails to control the behaviour of the encoder out of the support of the empirical data distribution. We show this behaviour of the VAE can lead to extreme errors in the recovered representation by the encoder and is a key hurdle in the effective use of representations for data-efficient learning and transfer. To address this problem, we propose to augment the data with properties that enforce insensitivity of the representation with respect to families of transformations. To incorporate these specifications, we propose a regularization method that is based on a selection mechanism that creates a fictive data point by explicitly perturbing an observed true data point. For certain choices of parameters, our formulation naturally leads to the minimization of the entropy regularized Wasserstein distance between representations. We illustrate our approach on standard datasets and experimentally show that significant improvements in the downstream adversarial accuracy can be achieved by learning robust representations completely in an unsupervised manner, without a reference to a particular downstream task and without a costly supervised adversarial training procedure. Figure 1: An illustration of the intrinsic fragility of VAE representations. Outputs from a Variational Autoencoder with encoder f and decoder g parametrized by η and θ, respectively, trained on Celeb A. Conditioned on the encoder input Xa = xa the decoder output X = g(f(xa)) = (g f)(xa) is shown on the top row. When the original example is perturbed with a carefully selected vector d such that Xb = Xa + d with d ϵ, the output X turns out to be perceptually very different. Such examples suggest that either the representations Za and Zb are very different (the encoder is not smooth), or the decoder is very sensitive to small changes in the representation (the decoder is not smooth), or both. We identify the source of the problem primarily as the encoder and propose a practical solution. It is clear that if learned representations are overly sensitive to irrelevant changes in the input (for example, small changes in the pixels of an image or video, or inaudible frequencies added to an audio signal), models that rely on these representations are naturally susceptible to make incorrect predictions when inputs are changed. We argue that such specifications about the robustness properties of learned representations can be one of the tractable guiding features in the search for good representations. Based on these observations, we make the following contributions: 1. We introduce a method for learning robust latent representations by explicitly targeting a structured model that admits the original VAE model as a marginal. We also show that in the case the target is chosen a pairwise conditional random field with attractive potentials, this choice leads naturally to the Wasserstein divergence between posterior distributions over the latent space. This insight provides us a flexible class of robustness metrics for controlling representations learned by VAEs. 2. We develop a modification to training algorithms for VAEs to improve robustness of learned representations, using an external selection mechanism for obtaining transformed examples and by enforcing the corresponding representations to be close. As a particular selection mechanism, we adopt attacks in adversarial supervised learning (Madry et al., 2017) to attacks to the latent representation. Using this novel unsupervised training procedure we learn encoders with adjustable robustness properties and show that these are effective at learning representations that perform well across a variety of downstream tasks. Published as a conference paper at ICLR 2020 3. We show that alternative models proposed in the literature, in particular β-VAE model used for explicitly controlling the learned representations, or Wasserstein Generative Adversarial Networks (GANs) can also be interpreted in our framework as variational lower bound maximization. 4. We show empirically using simulation studies on MNIST, color MNIST and Celeb A datasets, that models trained using our method learn representations that provide a higher degree of adversarial robustness even without supervised adversarial training. 2 GENERATIVE MODELS Modern generative models are samplers p(X|θ) for generating realizations from an ideal target distribution π(X), also known as the data distribution. In practice π(X) is unknown in the sense that it is hard to formally specify. Instead, we have a representative data set X, samples that are assumed to be conditionally independently drawn from the data distribution π(X) of interest. We will refer to the empirical distribution as ˆπ(X) = 1 |X| P ξ X δ(x ξ). The goal is learning a parameter θ such that p(X|θ = θ ) = R d Zp(X|Z, θ = θ )p(Z) ˆπ(X), thereby also learning a generator. 2.1 FROM VAE TO SMOOTH ENCODERS The VAE corresponds to the latent variable model p(X|Z, θ)p(Z) with latent variable Z and observation X. The forward model p(X|Z = z, θ) (the decoder) is represented using a neural network g with parameters θ, usually the mean of a Gaussian N(X; g(z; θ), v Ix) where v is a scalar observation noise variance and Ix is an identity matrix. The prior is usually a standard Gaussian p(Z = z) = N(z; 0, Iz). The exact posterior over latent variables p(Z|X = x, θ) is approximated by a probability model q(Z|X = x, η) with parameters η. A popular choice here is a multivariate Gaussian N(Z; µ(x; η), Σ(x; η)), where the mapping f such that (µ, Σ) = f(x, η) is chosen to be a neural network (with parameters η to be learned from data). We will refer to the pair f, g as an encoder-decoder pair. Under the above assumptions, VAE s are trained by maximizing the following form of the ELBO using stochastic gradient descent (SGD), log p(X = x|θ) E {log p(X = x|Z, θ)}q(Z|X=x,η) DKL(q(Z|X = x, η)||p(Z)) Bx(η, θ) The gradient of the Kullback-Leibler (KL) divergence term above (see A.1) is available in closed form. An unbiased estimate of the gradient of the first term can be obtained via sampling z from q using the reparametrization trick Kingma & Welling (2013), aided by automatic differentiation. 2.2 A PROBLEM WITH THE VAE OBJECTIVE Under the i.i.d. assumption, where each data point x(n), for n = 1 . . . N is independently drawn from the model an equivalent batch ELBO objective can be defined (See also E.1) as n=1 Bx(n)(η, θ) = DKL(ˆπ(X)q(Z|X, η)||p(X|Z, θ)p(Z)) + const (1) where the empirical distribution of observed data is denoted as ˆπ. This form makes it more clear that the variational lower bound is only calculating the distance between the encoder and decoder under the support of the empirical distribution. To see how this locality leads to a fragile representation, we construct a VAE with discrete latents and observations. We let X {1, . . . , Nx} and Z {1, . . . , Nz} and define the following system of conditional distributions as the decoder and encoder models as: p(X = i|Z = j) exp 1 v ω (mj i/Nx) q(Z = j|X = i) exp 1 σi ω (µi j/Nz) where ω(u) = cos(2πu). These distributions can be visualized by heatmaps of probability tables where i and j are row and column indicies, respectively Figure 2. This particular von-Mises like parametrization is chosen for avoiding boundary effects due to a finite latent and observable spaces. Published as a conference paper at ICLR 2020 Observation Probability Figure 2: Example VAE model. (left) Heatmap of the encoder distribution (darker colors referring to higher probability) q(Z = j|X = i; µi, σi) where each row i is a probability distribution over latents with a mode around µi and spread σi (middle) Heatmap of the decoder distribution p(X = i|Z = j, mj, vj) where each column j is a probability distribution with mode at mj and spread v. The prior p(Z) is chosen to be uniform and is not shown here. (right) The marginal model p(X = i|m, v) = PNz j=1 p(Z = j)p(X = i|Z = j, mj, v) depicted as an histogram. The prior p(Z) is taken as uniform, and is not shown. Note that this parametrization emulates a high capacity network that can model any functional relationship between latent states and observations, while being qualitatively similar to a standard VAE model with conditionally Gaussian decoder and encoder functions. In reality, the true target density is not available but we would have a representative sample. To simulate this scenario, we sample a dataset from a discrete target distribution π(X): this is merely a draw from a multinomial distribution, yielding a multinomial vector s with entries si that gives the count how many times we observe x = i. The results of such an experiment are depicted in Figure 3(a) (see caption for details). This picture reveals several important properties of a VAE approximation. Q P Data True Model Q P Data True Model Figure 3: (a) Result by optimizing the ELBO for a VAE that illustrates the fragility of the encoder. Subfigure with the title Data (ˆπ(X)) is a random sample from the true target Target (π(X)) on the right. The resulting encoder q(Z|X) and decoder p(X|Z) are shown as Q and P , respectively. The vertical and horizontal axes correspond to latents Z and observations X respectively. Integrating over the decoder distribution using a uniform prior p(Z) over the latents, we obtain the model distribution Model p(X) = P Z p(X|Z)p(Z). (b) The results obtained by a smooth encoder. Both the decoder and the representation (encoder) are more smooth while essentially having a similar fitting quality. 1. After training, we observe that when j and j are close, the corresponding conditionals p(X|Z = j) and p(X|Z = j ) are close (hence corresponding decoder mean parameters mj and mj are close, hence (see middle panel of Fig.3(a) with the title P showing the decoder). This smoothness is perhaps surprising at a first sight: in this example, we could arbitrarily permute columns of the decoder and still get the same marginal distribution. Technically speaking, given a uniform prior p(Z), the marginal likelihood p(X|θ) is entirely invariant with respect to permutations of the latent state. In fact if the encoder distribution wouldn t be constrained we could also permute the columns of the encoder to keep the ELBO invariant. In the appendix E.2, we provide an argument why the choice of an unimodal encoder model and optimization of the variational objective leads naturally to smooth decoder functions. 2. The encoders found by the VAE on the other hand are not smooth at all, despite the fact that the model shows a relatively good fit. This behaviour alerts us about judging generative models only by the quality of the samples, by traversing the latent space and generating conditional samples from the decoder. The quality of the decoder seems to be not a proxy for the robustness of the representation. Published as a conference paper at ICLR 2020 The fragility of representations is inherent from the ELBO objective. For the entire dataset, a batch ELBO that involves the counts si can be written as j siq(Z = j|Xa = i) log siq(Z = j|Xa = i) p(X = i|Z = j)p(Z = j) (2) The last expression is proportional to the negative KL divergence between two tabular distributions: siq(Z = j|Xa = i)/L and p(X = i|Z = j)p(Z = j). As such, whenever si is zero, the contribution of row i of the encoder distribution vanishes and the corresponding parameters µi and σi are not effecting the lower bound. In a sense, the objective does not enforce any structure on the encoder outside of the position of the data points in the training set. This figure shows that the outof-sample behaviour (i.e., for i where ˆπ(X) = 0) the encoder is entirely initialization dependent, hence no learning takes place. We would also expect that the resulting representations would be fragile, in the sense that a small perturbation of an observation can result in a large change in the encoder output. 3 ROBUST REPRESENTATIONS WITH SMOOTH ENCODERS In this section, we will adopt a strategy for training the encoder that is guaranteed not to change the original objective of the decoder when maximizing the lower bound while obtaining a smoother representation. The key idea of our approach is that we assume an external selection mechanism that is able to provide new fictive data point x in the vicinity of each observation in our data set x. Here, in the vicinity means that we desire that the corresponding latent state of the original datapoint z = f(x; η) and the latent state of the fictitious point z = f(x ; η) should be close to each other in some sense. Assuming the existence of such an external selection mechanism, we first define the following augmented distribution p(X = x, X = x |θ) Z p(X = x|Za, θ)p(X = x |Zb, θ)ψ(Za, Zb)d Zad Zb (3) where ψ(Za, Zb) = exp( γ 2 c(Za, Zb))p(Za)p(Zb). This is a pairwise conditional Markov random field (CRF) model (Lafferty et al., 2001; Sutton & Mc Callum, 2012), where we take c(Za, Zb) as a pairwise cost function. A natural choice here would be, for example, the Euclidean square distance Za Zb 2. Moreover, we choose a nonnegative coupling parameter γ 0. For any pairwise Q(Za, Zb) distribution, the ELBO has the following form log p(X = x, X = x |θ) E {log p(X = x|Za, θ)}Q(Za) + E {log p(Za)}Q(Za) +E {log p(X = x |Zb, θ)}Q(Zb) + E {log p(Zb)}Q(Zb) 2 E {c(Za, Zb)}Q(Za,Zb) + H(Q(Za, Zb)) (4) It may appear that the SE has to maintain a pairwise approximation distribution Q(Za, Zb). However, this turns out to be not necessary. Given the encoder, the marginals of Q(Za, Zb) are fixed as Qa(Za) = q(Z|Xa = x, η) and Qb(Zb) = q(Z|Xb = x, η), so the only remaining terms that depend on the pair distribution are the final two terms in (4). We note that this two terms are just the objective function of the entropy regularized optimal transport problem (Cuturi, 2013; Amari et al., 2017). If we view Q(Za, Zb) as a transport plan, the first term is maximal when the expected cost is minimal while the second term is maximal when the variational distribution is factorized as Q(Za, Zb) = Qa(Za)Qb(Zb). In retrospection, this link is perhaps not that surprising as the Wasserstein distance, the solution of the optimal transport problem, is itself defined as the solution to a variational problem (Solomon, 2018): Consider a set Γ of joint densities Q(Za, Zb) with the property that Q has fixed marginals Qa(Za) and Qb(Zb), i.e., Γ[Qa, Qb] Q : Qa(Za) = Z Q(Za, Zb)d Zb, Qb(Zb) = Z Q(Za, Zb)d Za Published as a conference paper at ICLR 2020 The Wasserstein divergence1, denoted by WD is defined as the solution of the optimization problem with respect to pairwise distribution Q WD[c](Qa, Qb) = inf Q Γ Z c(Za, Zb)Q(Za, Zb)d Zad Zb (6) where c(Za, Zb) is a function that specifies the cost of transferring a unit of probability mass from Za to Zb. It is important to note that with our choice of the particular form of the variational distribution Q(Za, Zb) we can ensure that we are still optimizing a lower bound of the original problem. We can achieve this by simply integrating out the X , effectively ignoring the likelihood term for the fictive observations. Our choice does not modify the original objective of the decoder due to the fact that the marginals are fixed given η. To see this, take the exponent of (4) and integrate over the unobserved X log p(X = x|θ) = log Z d X p(X = x, X |θ) E {log p(X = x|Za, θ)}Q(Za) + E {log p(Za)}Q(Za) + E {log p(Zb)}Q(Zb) 2 E {c(Za, Zb)}Q(Za,Zb) + H(Q(Za, Zb)) BSE(θ, η) (7) we name this lower bound BSE as the Smooth Encoder ELBO (SE-ELBO). The gradient of BSE with respect to the decoder parameters θ is identical to the gradient of the original VAE objective B. This is intuitive as x is an artificially generated sample, we should use only terms that depend on x and not on x . Another advantage of this choice is that it is possible to optimize the decoder and encoder concurrently as in the standard VAE. Only an additional term enters for the regularization of the encoder where the marginals obtained via amortized inference q(Za|xa, η) and q(Zb|xb, η) are forced to be close in a regularized Wasserstein distance sense, with the coupling strength γ. Effectively, we are doing data augmentation for smoothing the representations obtained by the encoder without changing the actual data distribution. In the appendix E.3, we also provide an argument about the smoothness of the corresponding encoder mapping, justifying the name. The resulting algorithm is actually a simple modification to the standard VAE and is summarized below: Initialize η(0), θ(0) for τ = 1, 2, . . . do xa = Get Data(), xb = Select(xa; L, ϵ) (see Section 3.1) µa, Σa = f(xa; η), µb, Σb = f(xb; η) (Compute Representation) WD2,γ(η) = Entropy Regularized Wasserstein Divergence(µa, Σa, µb, Σb, γ) (see Apdx. B.2) u N(0, I) (Reparametrization Trick) E1(η, θ) = 1 2v xa g(µa + Σ1/2 a u; θ) 2 (Data Fidelity) E2(η) = 1 2 µa 2 + µb 2 + Tr{Σa + Σb} (Prior Fidelity) E(η, θ) = E1(η, θ) + E2(η) WD2,γ(η) η(τ), θ(τ) = Optimization Step(E, η(τ 1), θ(τ 1)) end 3.1 SELECTION MECHANISM VIA ADVERSARIAL ATTACKS Adversarial attacks are one of the most popular approaches for probing trained models in supervised tasks, where the goal of an adversarial attack is finding small perturbations to an input example that would maximally change the output, e.g., flip a classification decision, change significantly a prediction (Szegedy et al., 2013). The perturbed input is named as an adversarial example and these extra examples are used, along with the original data points, for training adversarially robust models (Madry et al., 2017; Kurakin et al., 2016). As extra samples are also included, such a training procedure is referred as data augmentation. However, in unsupervised learning and density estimation, data augmentation is not a valid approach as the underlying empirical distribution would be altered by the introducing new points. 1We use the term divergence to distinguish the optimal transport cost from the corresponding metric. This distinction is reminiscent to the distinction between Euclidian divergence 2 and the Euclidian distance Published as a conference paper at ICLR 2020 However, as we let the encoder to target a different distribution than the actual decoder, we can actually use the extra, self generated samples to improve desirable properties of a given model. Hence this approach could also be interpreted as a self-supervised learning approach where we bias our search for a good encoder and the data selection mechanism acts like a critique, carefully providing examples that should lead to similar representations. In this paper we will restrict ourselves to Projected Gradient Descent (PGD) attacks popular in adversarial training Carlini & Wagner (2016) as a selection mechanism, where the goal of the attacker is finding a point that would introduce the maximum difference in the Wasserstein distance of the latent representation. In other words, we implement our selection mechanism where the extra data point is found by approximately solving the following constrained optimization problem x = x + arg max d: d p ϵ WD(q(Z|X = x, η), q(Z|X = x + d, η)) This attack is assigned a certain iteration budget L for a given radius ϵ, that we refer as selection iteration budget and the selection radius, respectively. We note a similar attack mechanism is proposed for generative models as described in (Kos et al., 2017), where one of the proposed attacks is directly optimizing against differences in source and target latent representations. Note that our method is not restricted to a particular selection mechanism; indeed two inputs that should give a similar latent representation could be used as candidates. 4 EXPERIMENTS Goal and Protocol In our experiments, we have tested and compared the adversarial accuracy of representations learned using a VAE and our smooth encoder approach. We adopt a two step experimental protocol, where we first train encoder-decoder pairs agnostic to any downstream task. Then we fix the representation, that is we freeze the encoder parameters and only use the mean of the encoder as the representation, then train a simple linear classifier based on the fixed representation using standard techniques. In this supervised stage, no adversarial training technique is employed. Ideally, we hope that such an approach will provide a degree of adversarial robustness, without the need for a costly, task specific adversarial training procedure. To evaluate the robustness of the resulting classifier, for each data point in the test set, we search for an adversarial example using an untargeted attack that tries to change the classification decision. The adversarial accuracy is reported in terms of percentage of examples where the attack is not able to find an adversarial example. The VAE and SE decoder and encoder are implemented using standard MLP and Conv Net architectures. The selection procedure for SE training is implemented as a projected gradient descent optimization (a PGD attack) with selection iteration budget of L iterations to maximize the Wasserstein distance between q(Z|X = x) and q(Z|X = x + δ) with respect to the perturbation δ where δ < ϵ. Further details about the experiment can be found in the appendix C.1. Results: We run simulations on Color MNIST, MNIST and Celeb A datasets. The Color MNIST is constructed from the MNIST dataset by coloring each digit artificially with all of the colors corresponding to the seven of the eight corners of the RGB cube (excluding black). We present the results with the strongest attack we have experimented: a PGD attack with 100 iterations and 10 restarts. We observe that for weaker attacks (such as 50 iterations with no restarts), the adversarial accuracy is typically much higher. For the Color MNIST dataset, the results are shown in Figure 4 where we test the adversarial accuracy of representations learned by our method and compare it to a VAE. We observe that the adversarial accuracy of a VAE representation quickly drops towards zero while SE can maintain adversarial accuracy in both tasks. In particular, we observe that for the arguably simpler color classification task, we are able to obtain close to perfect adversarial test accuracy using representations learned by the VAE and SE. However, when the classifiers are attacked using PGD, the adversarial accuracy quickly drops with increasing radius size, while the accuracy degrades more gracefully in the SE case. In Figure 5, we show the robustness behaviour of the method for different architectures. A Conv Net seems to perform relatively better than an MLP but these results show that the VAE representation is not robust, irrespective of the architecture. We have also carried out controlled experiments with random selection instead of the more costly untargetted adversarial attacks (See appendix C.1 Fig- Published as a conference paper at ICLR 2020 0.01 0.10 0.20 0.30 Attack Radius Adversarial Accuracy (%) Color Mnist (Arch: Conv, Task:Color, Sel. Iter. Budget:20) VAE SE, Sel Rad = 0.1 SE, Sel Rad = 0.2 VAE Nominal SE Nominal 0.01 0.10 0.20 0.30 Attack Radius Adversarial Accuracy (%) Color Mnist (Arch: Conv, Task:Digit, Sel. Iter. Budget:20) VAE SE, Sel Rad = 0.1 SE, Sel Rad = 0.2 VAE Nominal SE Nominal (a) Task: Color (b) Task: Digit Figure 4: Simulation Results on Color MNIST. The goal is the comparison of adversarial accuracy of VAE representations with SE representations trained with a selection radius of 0.1 and 0.2 and a selection budget of 20 PGD iterations. Vertical axis shows the adversarial accuracy as a function of attack radius. The dashed and dotted lines show the nominal accuracy of the VAE and SE when there are no attacks (The SE nominal accuracy is virtually identical for different selection radii, hence only a single level is shown.) ure 7(a) for further results). We observe some limited improvements with SE using random selection in adversarial accuracy compared to VAE but training a SE with adversarial selection seems to be much more effective. We note that the selection iteration budget was lower (L = 20 with no restarts) than the attack iteration budget (100 with 10 restarts) during evaluation. It was not practical to train the encoder with more powerful selection attacks, thus it remains to be seen if the tendency of increased accuracy with increased iteration budgets would continue. We also observe that essentially the same level of adversarial accuracy can be obtained with a small fraction of the available labels (See appendix C.1 Figure 8 for further results). 0.01 0.10 0.20 0.30 Attack Radius Adversarial Accuracy (%) Mnist (Arch: MLP, Task:Digit, Sel. Iter. Budget:50) VAE SE, Sel Rad = 0.1 SE, Sel Rad = 0.2 VAE Nominal SE Nominal 0.01 0.10 0.20 0.30 Attack Radius Adversarial Accuracy (%) Mnist (Arch: Conv, Task:Digit, Sel. Iter. Budget:50) VAE SE, Sel Rad = 0.1 SE, Sel Rad = 0.2 VAE Nominal SE Nominal (a) MLP (b) Conv Net Figure 5: Simulation Results on MNIST. The goal is illustrating the effect of the architecture (MLP and Conv Net). In all the examples, the SE is trained by a selection radius a budget of 50 PGD iterations. The linear classifier is always trained without any adversarial training, by fixing the encoder parameters. The blue dot, green triangle and red squares correspond to the standard VAE and SE trained with a selection radius of 0.1 and 0.2 respectively. The dashed and dotted lines show the nominal accuracy of the VAE and SE when there are no attacks (The SE nominal accuracy is virtually identical for different selection radii, hence only a single level is shown.) We have also repeated our experiments on the Celeb A dataset, a large collection of high resolution face images labeled with 40 attribute labels per example. We have used 17 of the attribute labels as the targets of 17 different downstream classification tasks. The results are shown in Table.2. The results clearly illustrate that we can achieve much more robust representations than a VAE. It is also informative to investigate specific adversarial examples to understand the failure modes. In Figure 6 we show two illustrative examples from the Celeb A. Here we observe that attacks to the SE representations are much more structured and semantically interpretable. In our exploratory investigations, we qualitatively observe that the reconstructions corresponding to the adversarial examples are almost always legitimate face images with clearly recognizable features. This also seems to support our earlier observation that VAE decoders are typically smooth while the encoders Published as a conference paper at ICLR 2020 Xa (g f)(Xa) Xb = Xa + d (g f)(Xb) d Figure 6: Qualitative results on Celeb A. Attacks to downstream tasks (a), (b) Mustache classification, (c), (d) Bald classification. (a) In the VAE case, the attack is successful but the perturbation does not have a visible structure. (b) The SE representation is attacked by a perturbation that can clearly be identified as drawing a beard on the image. In this case, the attack is able to fool the classifier and the generated image from the representation is that of a person with beard and mustache. In the second example (c), the VAE representation seems to be attacked by exploiting the non-smooth nature of the encoder by mapping the latent representation to the one in the vicinity of a clearly different person with the desired features, as can be seen from the corresponding reconstruction. In contrast, in the SE case (d), an unsuccessful attack that adds a much more structured perturbation. From the reconstruction it is evident that a latent feature is attacked that seems to control the receding hairline. are inferring non-robust features. Our approach seems to be a step towards obtaining more robust representations. VAE SE Task Nom Adv Nom Adv Bald 97.8 0.0 97.4 86.5 Mustache 96.1 0.0 95.7 84.4 Necklace 86.2 0.0 88.0 78.9 Straight Hair 79.1 0.0 78.7 77.3 Wearing Hat 96.6 0.0 96.4 77.3 Earrings 79.3 0.0 81.3 55.3 Blond Hair 92.4 0.0 90.7 53.5 Necktie 93.2 0.0 92.7 51.7 VAE SE Task Nom Adv Nom Adv Brown Hair 83.1 0.0 80.5 41.5 Eyeglasses 95.6 0.0 95.7 33.0 Black Hair 79.7 0.0 81.4 31.4 Bangs 90.3 0.0 89.6 27.0 No Beard 86.5 0.0 85.3 24.3 Wavy Hair 71.4 0.0 72.8 10.2 Smiling 82.7 0.0 85.7 1.1 Gender 81.1 0.0 81.6 0.7 Lipstick 79.2 0.0 80.3 0.6 Table 1: Comparison of nominal (Nom) and adversarial (Adv) accuracy (in percentage) on 17 downstream tasks using a VAE and a SE trained with a selection radius of ϵ = 0.1 and evaluated with attack radius of 0.1 and iteration budget of 100 with 10 restarts. See Section 4 for the experimental protocol. 5 RELATED WORK The literature on deep generative models and representation learning is quite extensive and is rapidly expanding. There is a plethora of models, but some approaches have been quite popular in recent Published as a conference paper at ICLR 2020 years: Generative Adversarial Networks (GANs) and VAEs. While the connection of our approach to VAE s is evident, there is also a connection to GANs. In the appendix, we provide the details where we show that a GAN decoder can be viewed as an instance of a particular smooth encoder. Our method is closely related to the β-VAE (Higgins et al., 2017), used for controlling representations replaces the original variational objective (1) with another one for explicitly trading the data fidelity with that of prior fidelity. In the appendix, we show that the method can be viewed as an instance of the smooth encoders. Wasserstein distance minimization has been applied in generative models as an alternative objective for fitting the decoder. Following the general framework sketched in Bousquet et al. (2017), the terms of the variational decomposition of the marginal likelihood can be modified in order to change the observation model or the regulariser. For example, Wasserstein Auto Encoders (WAE) Tolstikhin et al. (2017), Zhang et al. (2019) or sliced Wasserstein Autoencoders Kolouri et al. (2018) propose to replace data fidelity and/or the KL terms with a Wasserstein distance. Our approach is different from these approaches as we do not propose to replace the likelihood as a fundamental principle for data fitting. In contrast, the Wasserstein distance formulation naturally emerges from the particular model choice and the corresponding variational approximation. Our approach involves an adversarial selection step. The word Adversarial is an overloaded term in generative modelling so it is important to mention differences between our approach. Adversarial Variational Bayes is a well known technique in the literature that aims to combine the empirical success of GANs with the probabilistic formulation of VAEs, where the limiting functional form of the variational distribution can be replaced by blackbox inference (Mescheder et al., 2017). This approach also does not modify the original VAE objective, however, the motivation here is different as the aim is developing a more richer family. In our view, for learning useful representations, when the decoder is unknown, the advantage of having a more powerful approximating family is not clear yet; this can even make the task of learning a good representation harder. Adversarial Autoencoders (Makhzani et al., 2015), Adversarially Learned Inference (ALI) (Dumoulin et al., 2016) and Bi GANs (Bidirectional GANs) (Donahue et al., 2016) are also techniques that combine ideas from GANs and VAEs for learning generative models. The key idea is matching an encoder process q(z|x)p(x) and to the decoder process p(z)p(x|z) using an alternative objective, rather than by minimizing the KL divergence. In this formulation, p(x) is approximated by the empirical data distribution, and p(z) is the prior model of a VAE. The encoder q(z|x) and decoder p(x|z) are modelled using deep networks. This approach is similar to Wasserstein autoencoders that propose to replace the likelihood principle. The idea of improving VAEs by capturing the correlation structure between data points using MRFs and graphical models has been also been recently proposed (Tang et al., 2019) under the name Correlated Variational Auto-Encoders (CVAEs). Our approach is similar, however we introduce the correlation structure not between individual data points but only between true data points and artificially selected data points. We believe that correctly selecting such a correlation structure of the individual data points can be quite hard in practice, but if such prior knowledge is available, CVAE can be indeed a much more powerful model than a VAE. We note that a proposal for automatically learning such a correlation structure is also recently proposed by (Louizos et al., 2019). 6 DISCUSSION AND CONCLUSIONS In this paper, we have introduced a method for improving robustness of latent representations learned by a VAE. It must be stressed that our goal is not building the most powerful adversarially robust supervised classifier, but obtaining a method for learning generic representations that can be used for several tasks; the tasks can be even unknown at the time of learning the representations. While the nominal accuracy of an unsupervised approach is expected to be inferior to a supervised training method that is informed by extra label information, we observe that significant improvements in adversarial robustness can be achieved by our approach that forces smooth representations. ACKNOWLEDGMENTS We are grateful to Arnaud Doucet for the insights and references about optimal transport and Arnaud Doucet, Johannes Welbl, Sven Gowal and Michalis Titsias for their helpful comments to earlier drafts of this paper. Published as a conference paper at ICLR 2020 Alessandro Achille and Stefano Soatto. Emergence of Invariance and Disentanglement in Deep Representations. ar Xiv e-prints, art. ar Xiv:1706.01350, Jun 2017. Shun-ichi Amari, Ryo Karakida, and Masafumi Oizumi. Information Geometry Connecting Wasserstein Distance and Kullback-Leibler Divergence via the Entropy-Relaxed Transportation Problem. ar Xiv e-prints, art. ar Xiv:1709.10219, Sep 2017. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein GAN. ar Xiv e-prints, art. ar Xiv:1701.07875, Jan 2017. Yogesh Balaji, Hamed Hassani, Rama Chellappa, and Soheil Feizi. Entropic GANs meet VAEs: A Statistical Approach to Compute Sample Likelihoods in GANs. ar Xiv e-prints, art. ar Xiv:1810.04147, Oct 2018. Yoshua Bengio. Deep Learning of Representations: Looking Forward. ar Xiv e-prints, art. ar Xiv:1305.0445, May 2013. Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard Schoelkopf. From optimal transport to generative modeling: the VEGAN cookbook. ar Xiv eprints, art. ar Xiv:1705.07642, May 2017. Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β-VAE. ar Xiv e-prints, art. ar Xiv:1804.03599, Apr 2018. Nicholas Carlini and David Wagner. Towards Evaluating the Robustness of Neural Networks. ar Xiv e-prints, art. ar Xiv:1608.04644, Aug 2016. Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. ar Xiv e-prints, art. ar Xiv:1306.0895, Jun 2013. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ar Xiv e-prints, art. ar Xiv:1810.04805, Oct 2018. Jeff Donahue, Philipp Kr ahenb uhl, and Trevor Darrell. Adversarial Feature Learning. ar Xiv e-prints, art. ar Xiv:1605.09782, May 2016. Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially Learned Inference. ar Xiv e-prints, art. ar Xiv:1606.00704, Jun 2016. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. ar Xiv e-prints, art. ar Xiv:1406.2661, Jun 2014. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR), 2017. Matthew D. Hoffman and Matthew J. Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Advances in Approximate Bayesian Inference, NIPS 2016 Workshop, 2016. Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. ar Xiv e-prints, art. ar Xiv:1312.6114, Dec 2013. Soheil Kolouri, Phillip E. Pope, Charles E. Martin, and Gustavo K. Rohde. Sliced-Wasserstein Autoencoder: An Embarrassingly Simple Generative Model. ar Xiv e-prints, art. ar Xiv:1804.01947, Apr 2018. Published as a conference paper at ICLR 2020 Jernej Kos, Ian Fischer, and Dawn Song. Adversarial examples for generative models. ar Xiv eprints, art. ar Xiv:1702.06832, Feb 2017. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 25, pp. 1097 1105. Curran Associates, Inc., 2012. Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236, 2016. John D. Lafferty, Andrew Mc Callum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML 01, pp. 282 289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-778-1. URL http://dl.acm. org/citation.cfm?id=645530.655813. Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4114 4124, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/locatello19a.html. Christos Louizos, Xiahan Shi, Klamer Schutte, and Max Welling. The Functional Neural Process. ar Xiv e-prints, art. ar Xiv:1906.08324, Jun 2019. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial Autoencoders. ar Xiv e-prints, art. ar Xiv:1511.05644, Nov 2015. Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks. ar Xiv e-prints, art. ar Xiv:1701.04722, Jan 2017. Jimenez Danilo Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ar Xiv e-prints, art. ar Xiv:1401.4082, Jan 2014. Justin Solomon. Optimal Transport on Discrete Domains. ar Xiv e-prints, art. ar Xiv:1801.07745, Jan 2018. Charles Sutton and Andrew Mc Callum. An introduction to conditional random fields. Found. Trends Mach. Learn., 4(4):267 373, April 2012. ISSN 1935-8237. doi: 10.1561/2200000013. URL http://dx.doi.org/10.1561/2200000013. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. Da Tang, Dawen Liang, Tony Jebara, and Nicholas Ruozzi. Correlated Variational Auto-Encoders. ar Xiv e-prints, art. ar Xiv:1905.05335, May 2019. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein Auto Encoders. ar Xiv e-prints, art. ar Xiv:1711.01558, Nov 2017. Shunkang Zhang, Yuan Gao, Yuling Jiao, Jin Liu, Yang Wang, and Can Yang. Wasserstein Wasserstein Auto-Encoders. ar Xiv e-prints, art. ar Xiv:1902.09323, Feb 2019. Published as a conference paper at ICLR 2020 A.1 KL DIVERGENCE The KL divergence between two Gaussian distributions translates to a well known divergence in the parameters (in the general case this is a Bregman divergence) KL(Pa||Pb) = 1 2 Tr Σ 1 b (Σa Σb) log |Σ 1 b Σa| + 1 2(µa µb) Σ 1 b (µa µb) (8) where Pa = N(µa, Σa) and Pb = N(µb, Σb) are Gaussian densities with mean µ and covariance matrix Σ , and | | denotes the determinant for a matrix argument, and Tr denotes the trace. The KL divergence consists of two terms, the first term is the scale invariant divergence between two covariance matrices also known as a Itakuro-Saito divergence and the second term is a Mahalonobis distance between the means. The KL divergence is invariant to the choice of parametrization or the choice of the coordinate system. B OPTIMAL TRANSPORT AND WASSERSTEIN DISTANCE Consider a set Γ of joint densities Q(Za, Zb) with the property that Q has fixed marginals Qa(Za) and Qb(Zb), i.e., Γ[Qa, Qb] Q : Qa(Za) = Z Q(Za, Zb)d Zb, Qb(Zb) = Z Q(Za, Zb)d Za The Wasserstein divergence WD is defined as the solution of the optimization problem with respect to pairwise distribution Q WD[c](Qa, Qb) = inf Q Γ Z c(Za, Zb)Q(Za, Zb)d Zad Zb (10) where c(za, zb) is a function that specifies the cost of transferring a unit of probability mass from za to zb. B.1 ℓ2-WASSERSTEIN DISTANCE W The ℓ2-Wasserstein distance W2 2 for two Gaussians has an interesting form. The optimum transport plan, where the minimum of (10) is attained, is given Q (za, zb) = N µa µb , Σa Ψ Ψ Σb where Ψ = ΣaΣ1/2 b (Σ1/2 b ΣaΣ1/2 b ) 1/2Σ1/2 b . It can be checked that this optimal Guassian density is degenerate in the sense that there exists a linear mapping between za and zb: za(zb) = µa + ΣaΣ1/2 b (Σ1/2 b ΣaΣ1/2 b ) 1/2Σ 1/2 b (zb µb) where A1/2 denotes the matrix square root, a symmetric matrix such that (A1/2)2 = A for a symmetric positive semidefinite matrix A. The ℓ2-Wasserstein distance is the value attained by the optimum transport plan W2 2(Pa, Pb) = µa µb 2 2 + Tr Σa + Σb 2 Σ1/2 b ΣaΣ1/2 b 1/2 B.2 ENTROPY REGULARIZED ℓ2-WASSERSTEIN DISTANCE Entropy Regularized ℓ2-Wasserstein is the value attained by the minimizer of the following functional 2 E Tr(Za Zb)(Za Zb) Q(Za,Zb) H[Q(Za, Zb)] (13) where H is the entropy of the joint distribution Q. Using the form in (11) subject to the semidefinite constraint Σa ΨΣ 1 b Ψ 0 Tr (za zb) (za zb) = 2 Tr(Ψ) + const (14) Published as a conference paper at ICLR 2020 The entropy of a Gaussian Q(za, zb) is given by the Schur formula H[Q(za, zb)] = D 2 log(2πe) + 1 2 log |Σb||Σa ΨΣ 1 b Ψ | (15) Here, D is the dimension of the vector (za, zb). The entropy regularized problem has a solution where we need to minimize F(Ψ) = γ Tr(Ψ) 1 2 Tr log Σa ΨΣ 1 b Ψ (16) Taking the derivative and setting to zero Ψ = γI + Σ 1 b Ψ Σa ΨΣ 1 b Ψ 1 (17) we obtain a particular Matrix Ricatti equation 0 = ΨΣ 1 b Ψ 1 γ Σ 1 b Ψ + Σa (18) that gives us a closed form formula for the specific entropy regularized Wasserstein distance W2 2,γ(N(ma, Σa), N(mb, Σb)) = ma mb 2 2 + Tr{Σa + Σb 2Ψ} (19) WD2,γ(N(ma, Σa), N(mb, Σb)) γ 2 W2 2,γ(N(ma, Σa), N(mb, Σb)) (20) 2 log(2πe) 1 2 log |Σb||Σa ΨΣ 1 b Ψ | (21) For the case of two univariate Gaussians, i.e., when the joint distribution has the form Q(Za, Zb) = N ma mb , Σa ψ ψ Σb the solution is given by the solution of the scalar quadratic equation. f(ψ) = ψ2 + 1 γ ψ ΣaΣb = 0 (22) 2γ 1 2|γ| 1 + 4γ2ΣaΣb 1/2 (23) We take the root that gives a feasible solution as the minimizer. In the scalar case, this is the solution that satisfies Σa ψ2/Σb 0, or equivalently ΣaΣb ψ2 2γ (uγ(Σa, Σb) 1) (24) where we have defined uγ(Σa, Σb) = 1 + 4γ2ΣbΣa 1/2 It can be easily checked that the other root is infeasible. For the scalar ψ case we obtain WD2,γ(N(ma, Σa), N(mb, Σb)) = γ 2 ma mb 2 2 + Σa + Σb 1 2(uγ(Σa, Σb) 1) 2 log(uγ(Σa, Σb) + 1) 1 2 log(2ΣbΣa) log(2π) 1 C SUMMARY OF THE SMOOTH ENCODER ALGORITHM WITH FACTORIZED GAUSSIAN Assume a factorized encoder distribution of form q(Za|x, η) = QDz k=1 N(Zk a; µk a, Σk a) and q(Zb|x , η) = QDz k=1 N(Zk b ; µk b, Σk b) where Dz is the dimension of the latent representation, and Published as a conference paper at ICLR 2020 µk and Σk are the k th component of the output of a neural network with parameters η. Similarly, xi denotes the i th component of the observation vector x of size Dx. For optimization, we need an unbiased estimate of the gradient of the SE-ELBO with respect to encoder parameters η and decoder parameters θ: BSE(η, θ) = E {log p(X = x|Za, θ)}q(Za|xa,η) + E {log p(Za)}q(Za|xa,η) + E {log p(Zb)}q(Zb|xb,η) 2 E {c(Za, Zb)}q(Za,Zb|xa,xb,η) + H(q(Za, Zb|xa, xb, η)) Given x, we first select a fictive sample x via a selection mechanism, in this case as an adversarial attack as explained in section 3.1. Sample a latent representation and calculate the associated prediction za q(Za|Xa = x, η) = N(Za; µa, Σa) x = g(za; η) The terms of the SE-ELBO can be calculated as E {log p(x|Za, θ)}q(Za|Xa=x,η) Dx 2 log 2πv 1 i=1 (xi xi)2 E {log p(Za)}q(Za|Xa=x,η) = 1 k=1 ((µk a)2 + Σk a) Dz E {log p(Zb)}q(Zb|Xb=x ,η) = 1 k=1 ((µk b)2 + Σk b) Dz WD2,γ = γ 2 E Za Zb 2 q(Za,Zb|xa,xb,η) H(q(Za, Zb|xa, xb, η)) uγ(Σa, Σb) = p 1 + 4γ2ΣbΣa WD2,γ = γ 2 µk a µk a 2 2 + Σk a + Σk b (uγ(Σk a, Σk b) 1) log(uγ(Σk a, Σk b) + 1) + log(2Σk bΣk a) Dz log(2π) Dz C.1 EXPERIMENTAL DETAILS AND FURTHER RESULTS We always train decoder-encoder pairs with identical architectures using both the standard VAE ELBO and the SE ELBO with a fixed γ. Then, in each case by fixing the encoder (that is essentially using the same representation) and by only using the mean activations of the encoders, we train linear classifiers using standard training for solving several downstream tasks. For both encoder and decoder networks we use a 4 layer multi layer perceptron (MLP) and a convolutional network (Conv NET) architectures with 200 units of Re LU activations at each layer. We carried out experiments with latent space dimensions of 32, 64 and 128, corresponding to an output sizes of an encoder with 64, 128 and 256 units, with two units per dimensions to encode the mean and the log-variance parameters of a fully factorized Gaussian condition distribution. The training is done using the Adam optimizer. Each network (both the encoder and decoder) are randomly initialized and trained for 300K iterations. Published as a conference paper at ICLR 2020 MNIST Digit Color-MNIST Color, Digit Celeb A see table 1 Training Representation Dimension, dim(Z) 32, 64, 128, 256 VAE or SE Observation noise variance, v 0.25, 0.5, 1., 2. Architecture MLP, Conv NET Training Coupling Strength γ 0.01, 0.1, 1, 5, 10, 50 SE Only Selection PGD Radius ϵ 0.01, 0.1, 0.2, 0.3 Selection PGD Iteration Budget L 1, 5, 10, 20, 50 Evaluation Attack PGD Radius to downstream task 0.01, 0.1, 0.2, 0.3 Attack PGD Iteration Budget 100 (10 random restarts) Table 2: Experiment Hyperparameters 0.01 0.10 0.20 0.30 Attack Radius Adversarial Accuracy (%) Color Mnist (Arch: Conv, Task:Digit, Sel. Iter. Budget:20) VAE SE Selection: PGD SE Selection: Random VAE Nominal SE Nominal 0.01 0.10 0.20 0.30 Attack Epsilon Adversarial Accuracy (%) Color Mnist (Arch: Conv, Task:Color, Sel. Eps.: 0.2) SE 1 SE 5 SE 10 SE 20 Figure 7: Simulation Results on Color MNIST. (a) The goal is comparing the robustness as a result of the selection procedures, uniformly random selection from the unit ball and adversarial selection. (b) The effect of the selection budget L on the adversarial accuracy. The plot shows adversarial accuracy results for L = 1, 5, 10, 20. 0 2 4 6 8 10 Percentage of data used for training Adversarial Accuracy (%) Celeb A (Arch: Conv, Task:Mustache, Sel. Iter. Budget:20, Sel. Eps.:0.1) Adv. acc of full data 0 2 4 6 8 10 Percentage of data used for training Adversarial Accuracy (%) Celeb A (Arch: Conv, Task:Bald, Sel. Iter. Budget:20, Sel. Eps.:0.1) Adv. acc of full data (a) Mustache (b) Bald Figure 8: Simulation Results that illustrate label efficiency on Celeb A. Dotted line shows downstream task adversarial accuracy obtained by using all the labelled data. Once the representation is fixed, it is feasible to achieve the same accuracy with a small fraction of data. D RELATED WORK D.1 GENERATIVE ADVERSARIAL NETWORKS (GANS) GANs are presented as neural sampling models of observations x of form x = f(ζ; η) where f is typically a deep neural network with parameters η, and ζ is a realization from some simple distribution p(Z). In the context of GANs, the function f is called a generator. When the dimension of x Published as a conference paper at ICLR 2020 is bigger than the dimension of ζ, the density p(x) induced by the transformation f is inevitably a degenerate distribution. Since f is continuous, and it is concentrated on a subset of the data space Xf {x : ζ, x = f(ζ; η)}. Our use of letter f and parameters η is deliberate and we will illustrate in the sequel that the generator network of a GANs is actually analogous to a smooth encoder, where the roles of the latent variables and observations are switched, but we will first review GANs. To fit a degenerate distribution to a dataset, the GAN approach adopts a strategy where the generator is co-trained with a second neural network d(x; w) with parameters w with the following objective min θ max w n E {log d(x; w)}Dreal(x) + E {log(1 d(g(ζ; θ); w))}p(ζ) o where Dreal(x) is the empirical data distribution. This objective is (unfortunately) referred as an adversarial objective in the literature, not to be confused with adversarial attack mechanism in the context of supervised learning Madry et al. (2017). The function d is called a discriminator. After replacing expectations with finite sample averages, this objective enforces that in a dataset that contains both synthetically generated (fake) and real examples, the classification function d should increase the correct classification rate by discriminating fakes from real examples while the generator f should decrease the detection rate of fake examples. When 0 d( ) 1, which is the case for a classifier, one can also write the objective as min θ max w n E {l(x; w)}Dreal E {l(f(ζ; η); w))}p(ζ) o where l(x; w) = log d(x; w). This form also highlights an interesting alternative formulation and an interpretation in terms of optimal transport. In fact, not long after the seminal work of Goodfellow et al. (2014), the mechanism beyond the GAN objective and its direct connection to the theory of optimal transport has been recognized by the seminal paper Arjovsky et al. (2017) where the problem is further framed as min θ max w n E {l(x; w)}Dreal(x) E {l( x; w)}Dfake( x;θ) o with the constraint that |l(x; w) l( x; w)| c(x, x) , i.e. l is a Lipschitz function for some L where c(x, x) L x x . Here, Dfake( x; θ) is the fitted density of x = f(ζ; η). This is the dual formulation of the optimal transport problem, that can be understood as an economic transaction between a customer and a shipment company. Here, the difference l(x; w) l( x; w) can be interpreted as the profit made by the company for the shipment of one unit of mass from x and to x, and the Lipschitz condition ensures that it makes still sense for the customer to make use of the services of the company rather than simply doing the transport of her own (Solomon, 2018). The customer wants to pay less, so she should minimize the profit of the company. This can be achieved by changing the desired delivery distribution Dfake by adjusting θ, so that the transfer from the fixed source distribution Dreal is minimized. Ideally, when Dfake = Dreal, there is nothing to transfer and no cost is incurred. This objective also minimizes the Wasserstein distance between the actual data distribution Dreal and the fake data distribution Dfake as given by the generator. Once the GAN objective can be viewed as minimizing a particular Wasserstein distance, it is rather straightforward to view it as a maximizer of a particular ELBO corresponding to a particular smooth encoder, albeit in one where the positions of the observations and the latents are exchanged and a very large coupling coefficient γ is chosen. Moreover, the variational marginals have specific forms: One marginal Qa(X) is chosen as the empirical data distribution and the other marginal is chosen as having the form of a neural sampler Qb(Xb) = R q(Xb|Zb, η)p(Zb)d Zb. The artificial extended target becomes p(Z, Z |X, θ) Z d Xbp(Z|X, θ)p(Z |Xb, θ)ψ(X, Xb) (28) It can be seen that the ELBO in this case becomes log p(Z, Z |X, θ) E {log p(Z|X, θ)}Qa(X) + E {log p(X)}Qa(X) +E {log p(Z |Xb, θ)}Qb(Xb) + E {log p(Xb)}Qb(Xb) 2 E {c(Xa, Xb)}Q(Xa,Xb) + H(Q(Xa, Xb)) (29) Published as a conference paper at ICLR 2020 Now, by taking the coupling γ sufficiently large, the coupling term dominates the lower bound and we obtain the Wasserstein minimization objective. The random draws from p(Z) become the selection mechanism. Moreover, the terms that depend on the artificial target p(Z|X, θ) become also irrelevant so in this regime the problem becomes just solving the optimal transport problem between Qa and Qb. A link between entropic GANs and VAEs is also pointed at in the literature, albeit for calculating a likelihood for GANs Balaji et al. (2018). However, our motivations as well as the interpretation of the connection is quite different and we view the GAN decoder as an instance of the smooth encoder. D.2 DISENTANGLED REPRESENTATIONS AND β-VAE Targeting the encoder to an augmented distribution different than the decoder us the freedom to express some extensions of VAE in the same framework. One of such extensions is the β-VAE, quite often used for controlling representations replaces the original variational objective (1) with the following objective log p(X = x|θ) E {log p(X = x|Z, θ)}q(Z|Xa=x,η) βDKL(q(Z|Xa = x, η)||p(Z))(30) The justification in the original paper Higgins et al. (2017) is obtained from an implicit robustness criteria where DKL(q(Z|Xa = x, η)||p(Z)) < ϵ and β appears in a Lagrangian formulation. Hoffman & Johnson (2016) have also provided an alternative justification. In our formulation, β can be simply interpreted as a dispersion term that is related to the number of points selected by the selection mechanism. To see this, suppose the selection mechanism chooses β 1 points xb,i where i = 1 . . . β 1 that are identical to the true observation x = xb,i = x i for i = 1 . . . β 1. p(X|θ) = Z d X 1:β 1p(X, X 1:β 1|θ) (31) Z d X 1:β 1d Zd Z 1:β 1p(X|Z, θ)p(Z) i=1 p(X i|Z i, θ)p(Z i) = Z d Zd Z 1:β 1p(X|Z, θ)p(Z) Now, instead of integrating out Z 1:β 1, we choose a variational distribution with identical marginals of form Q(Z, Z 1:β 1) = q(Z|Xa = x, η) i=1 q(Z i|Xb,i = x, η) (34) The variational lower bound becomes identical to the β-VAE objective as log p(X|θ) E {log p(X|Z, θ)}q(Z|Xa=x,η) + E {log p(Z)}q(Z|Xa=x,η) (35) i=1 E {log p(Z i)}q(Z i|Xb,i=x,η) + H(Q(Z, Z 1:β 1)) (36) = E {log p(X = x|Z, θ)}q(Z|Xa=x,η) βDKL(q(Z|Xa = x, η)||p(Z)) (37) where the last step follows due to the functional form of the variational distribution. E TECHNICAL RESULTS E.1 BATCH ELBO In section 2.2, we have defined a batch ELBO (2). To see the connection to VAE ELBO (1) log p(X = x|θ) E {log p(X = x|Z, θ)}q(Z|X=x,η) DKL(q(Z|X = x, η)||p(Z)) Bx(η, θ) Published as a conference paper at ICLR 2020 we first define the empirical data distribution π(X) = 1 N PN i=1 δ(X xi). We can now write log p(X = x|θ) E {log p(X = x|Z, θ)}q(Z|X=x,η) DKL(q(Z|X = x, η)||p(Z)) Bx(η, θ) i=1 log p(X = xi|θ) = 1 N i=1 E {log p(X = xi|Z, θ)}q(Z|X=xi,η) i=1 E {log q(Z|X = xi, η)}q(Z|X=xi,η) i=1 E {log p(Z)}q(Z|X=xi,η) = E {log p(X|Z, θ)}q(Z|X,η)π(X) E {log q(Z|X, η)}q(Z|X,η)π(X) +E {log p(Z)}q(Z|X,η)π(X) E {log π(X)}q(Z|X,η)π(X) + E {log π(X)}π(X) = DKL(q(Z|X, η)π(X)||p(X|Z, θ)p(Z)) + const This result shows that the ELBO is minimizing the KL distance between one exact and one approximate factorization of the joint distribution p(X, Z) = p(X|Z, θ)p(Z) q(Z|X, η)π(X). E.2 WHY IS THE DECODER TYPICALLY SMOOTH AFTER THE VAE TRAINING? In the context of a VAE, the smoothness of the decoder is implicitly enforced by the highly constrained encoder distribution and the dynamics of an SGD based training. In the sequel, we will illustrate that, if two latent coordinates are sufficiently close, the decoder mean mapping is forced to be bounded. In a standard VAE, the encoder output for each data point is conditionally Gaussian as q(Z|X = x; η) = N(fµ(x; η), fΣ(x; η)). The decoder is chosen as p(X|Z = z; η) = N(g(z; θ), v I). The decoder parameters θ under the ELBO depend only on the data fidelity term x g(z; θ) 2/v. For a moment, assume that the encoder is fixed and focus on a single data point x. During training, a set of latent state vectors zi for i = 1 . . . T are sampled from the conditionally Gaussian encoder distribution. When the dimension of the latent space Dz is large, these samples zi will be with high probability on the typical set. The typical set of a nondegenerate Gaussian distribution is approximately the surface of a Mahalanobis ball, a compact hyper-ellipsoid M(x) centered at fµ(x; η) with scaling matrix fΣ(x; η)1/2. If we assume that the training procedure is able to reduce the error in the sense that x g(zi; θ) E for all zi where E is a bound on the error magnitude for zi sampled from the encoder, the decoder is forced to give the same output for each point approximately on M(x). For a point za drawn from q(Z|X = x; η) we have za fµ(x; η) K p Dz with high probability where K = fΣ(x; η) 1 and x K For a point zb independently drawn from q(Z|X = x; η), by the triangle inequality we have g(za; θ) g(zb; θ) 2E (38) where the Mahalanobis distance Dz za zb K 1 λmin za zb where λmin is the smallest eigenvalue of the covariance matrix. Hence the distance is also bounded when the variance is not degenerate and minimum distance will be on the order of za zb 2 Dzλmin so we expect the ratio to be bounded g(za; θ) g(zb; θ) / za zb E/ p Dzλmin (39) Published as a conference paper at ICLR 2020 We see that the ELBO objective enforces the decoder to be invariant on the typical set of q(Z|X = x; η), where most of the probability mass is concentrated. Now, for each data point x, the corresponding latent space hyper-ellipsoid M(x) are forced to be large in the sense of having a large determinant by the entropy term of the encoder that promotes large log-determinant. The size of M(x) is also controlled by the prior fidelity term, avoiding blowing up. Hence the union x X M(x), where X is the dataset, will approximately cover the latent space when the encoder has converged and on each hyper-ellipsoid M(x) the decoder will be enforced to be smooth. E.3 SMOOTHNESS OF THE SMOOTH ENCODER In this section we show that the smooth encoder training forces a small Lipschitz constant for the encoder mean mapping. To simplify the argument, we will assume that the variance mapping of the encoder would be a constant function that does not vary with x, i.e., fΣ(x; η) = Σ(η). The latter assumption could be removed by considering a metric on the joint space of the means and covariance. Using the adversarial selection mechanism, during training we solve the following problem using PGD: x = arg max x : x x p ϵ WD(q(Z|X = x, η), q(Z|X = x , η)) Assuming that PGD finds the global maximum at the boundary of the ϵ-ball where x x p = ϵ, under constant variance assumption for the encoder we can see that the Wasserstein divergence simply becomes the square distance between mean mappings WD(q(Z|X = x, η), q(Z|X = x , η)) = fµ(x; η) fµ(x ; η) 2 2 We know that the SE ELBO objective has to minimize this distance for any coupling term γ so the procedure actually tries to reduce the local Lipschitz constant L(x) around data point x L(x) = fµ(x; η) fµ(x ; η) and promotes smoothness where E is an upper bound on the change in the representation fµ(x; η) fµ(x ; η) E.