# parameterized_ratedistortion_stochastic_encoder__62c0447f.pdf Parameterized Rate-Distortion Stochastic Encoder Quan Hoang 1 Trung Le 1 Dinh Phung 1 We propose a novel gradient-based tractable approach for the Blahut-Arimoto (BA) algorithm to compute the rate-distortion function where the BA algorithm is fully parameterized. This results in a rich and flexible framework to learn a new class of stochastic encoders, termed PArameterized RAte DIstortion Stochastic Encoder (PARADISE). The framework can be applied to a wide range of settings from semi-supervised, multi-task to supervised and robust learning. We show that the training objective of PARADISE can be seen as a form of regularization that helps improve generalization. With an emphasis on robust learning we further develop a novel posterior matching objective to encourage smoothness on the loss function and show that PARADISE can significantly improve interpretability as well as robustness to adversarial attacks on the CIFAR-10 and Image Net datasets. In particular, on the CIFAR-10 dataset, our model reduces standard and adversarial error rates in comparison to the state-of-the-art by 50% and 41%, respectively without the expensive computational cost of adversarial training. 1. Introduction The main objective of representation learning is to learn good representation that can be used for downstream tasks. From this standpoint, rate-distortion theory offers an attractive approach for representation learning where the goodness of a representation can be measured by an appropriately defined distortion function. Rate distortion theory is however often applied in machine learning in the form of the Information Bottleneck (IB) method (Tishby et al., 2000), which measures goodness of a representation by the mutual information with a relevance variable. Given the input random variable X and a relevance random variable 1Department of DSAI, Faculty of Information Technology, Monash University, Australia. Correspondence to: Quan Hoang . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Y , the objective is to compress X into a representation Z that captures as much information about Y as possible while retaining as little information about X as possible. Mathematically speaking, the IB objective finds the optimal encoding q (z|x) by minimizing the functional: L [q (z|x)] = I (Z; X) βI (Z; Y ) (1) where β is the Lagrange multiplier. Unfortunately, the iterative algorithm proposed in (Tishby et al., 2000) to optimize the IB objective was intractable and therefore infeasible to apply in practice. Recently, Alemi et al. (2016) proposed Deep Variational Information Bottleneck (DVIB), a variational approximation to the IB objective by using variational lower-bound and upper-bound of mutual information, and claimed that DVIB improves robustness to adversarial attacks. However, DVIB is just an instance of gradient obfuscation (Athalye et al., 2018) as latter shown in Sec. 3.3, thus giving a false sense of robustness. Furthermore, by defining goodness of the representation as the mutual information with another variable, the Information Bottleneck method significantly limits the flexibility and potential applications of the rate distortion theory. In this paper, we revisit the Blahut-Arimoto algorithm (Blahut, 1972; Arimoto, 1972) and make a simple modification to make it feasible to numerically compute the rate-distortion function with gradient-based optimization. The result is an elegant and flexible framework for representation learning that can be applied to a wide range of settings from unsupervised, semi-supervised, multi-task to supervised and robust learning. The key component in our framework is a parameterized stochastic encoder that we term PArameterized RAte-DIstortion Stochastic Encoder or PARADISE. We investigate the behavior of the algorithm on the MNIST (Le Cun et al., 1998) and Celeb A (Liu et al., 2015) datasets. For supervised learning, we demonstrate that the derived objective can be seen as a form of regularization that helps improve generalization. For robust learning, we show that introducing inductive bias to the learning of PARADISE can significantly improve interpretability as well as robustness to adversarial attacks on the Cifar-10 (Krizhevsky et al., 2009) and Image Net (Russakovsky et al., 2015) datasets. In particular, on the CIFAR-10 data set, our model reduces standard and adversarial error rates in comparison to the state-of-the-art (Qin et al., 2019) by 50% Parameterized Rate-Distortion-based Stochastic Encoder and 41%, respectively without the expensive computational cost of adversarial training. In short, our main contributions are: (i) a novel gradientbased tractable approach for the Blahut-Arimoto (BA) algorithm to compute the rate-distortion function where the BA algorithm is fully parameterized; (ii) a new class of stochastic encoders, termed PArameterized RAte DIstortion Stochastic Encoder (PARADISE), that can be applied to a wide range of settings from semi-supervised, multi-task to supervised and robust learning; (iii) a novel posterior matching objective for robust learning; (iv) a comprehensive evaluation of PARADISE for supervised and robust learning; and (v) a new state-of-the-art result for adversarial accuracy against untargeted white-box attack for Cifar-10, reducing state-of-the-art adversarial errors by 41%. 2. Theoretical Framework Rate distortion. We start with some key results in ratedistortion theory used in this work.1 This theory was developed by (Shannon, 1948) in the context of transmitting information over noisy channels. Given an input (message) sequence xn = (x1, x2, ..., xn) where xi is drawn i.i.d. from a source distribution p (x), x X, a communication channel receives the input sequence and outputs a sequence zn = (z1, z2, ..., zn), zi Z. The sequence zn is commonly referred to as the codeword, reconstruction or representation. In this work, we call zn representation sequence and each element zi representation to avoid confusion. To measure the quality of the output representation, a distortion measure (or function) is defined as a mapping from the set of input-representation pairs to the set of non-negative real numbers: d : X Z R+ Given the distortion measure d(x, z), the distortion between sequences xn and zn is: d (xn, zn) = 1 i=1 d (xi, zi) The rate-distortion function R (D) is a mapping from the distortion budget D to the minimum rate of the communication channel for which the expected distortion stays within the budget as n goes to infinity. Rate can be interpreted as the average number of bits per representation required to specify each representation sequence without confusion. An important result in rate-distortion theory is that for bounded distortion function d (x, z), the rate-distortion function is equal to the information rate-distortion function R(I) (D)2: R(I) (D) = min p(z|x):P (x,z) p(x)p(z|x)d(x,z) D I (X; Z) (2) 1Interested readers are encouraged to find more details in (Cover & Thomas, 2012)(ch 10). 2For the sake of simplicity, we slightly abuse the symbol P to denote the integration for both continuous and discrete cases. where the minimization is over all conditional distribution p (z | x) for which the expected distortion over the joint distribution p (x, z) = p (x) p (z | x) stays within the distortion budget D. For the rest of the paper, we will use R (D) to denote both the canonical and information rate distortion function. There is nice intuition as to why the average number of bits per transmission required is the mutual information I (X; Z). The conditional distribution p (z | x) induces a soft partitioning of X. The average volume of the elements of X mapped to the same representation is 2H(X|Z) where H (X | Z) is the conditional entropy of X given Z. Since the volume of X is 2H(X), the average cardinality of the partitioning of X is 2H(X)/2H(X|Z) = 2I(X;Z). To compute the rate-distortion function, the Blahut-Arimoto algorithm (Blahut, 1972; Arimoto, 1972) makes use of the following lemma: Lemma 1. Let p (x, z) = p (x) p (z | x) be a given joint distribution. The distribution q (z) that minimizes the Kullback-Leibler (KL) divergence KL [p (x, z) p (x) q (z)] is the marginal distribution p (z) corresponding to p (z | x): q (z) = argmin q(z) KL [p (x, z) p (x) q (z)] = p (z) where p (z) = P x p (x) p (z | x). Proof. See Sec. 1 of the supplementary material. Lemma 1 turns the problem of computing the rate-distortion function in Eq. (2) into a double minimization problem: R (D) = min q(z) min p(z|x):Ed(x,z) D x,z p (x, z) log p (x, z) p (x) q (z) By introducing the Lagrange multiplier β for the distortion constraint, computing the rate-distortion function becomes minimizing the following functional: F [p (z | x) , q (z)] = X x,z p (x) p (z | x) log p (x) p (z | x) p (x) q (z) x,z p (x) p (z | x) d (x, z) (3) The Blahut-Arimoto algorithm applies the process of alternating minimization: for a fixed conditional distribution pt (z | x), the optimal distribution qt (z), by Lemma 1, is the marginal distribution pt (z) = P x p (x) pt (z | x); for a fixed qt (z), the optimal distribution pt+1 (z | x) can be found analytically as3: 3For self-completeless, see Lemma 2 and the proof in the supplementary material. Parameterized Rate-Distortion-based Stochastic Encoder pt+1 (z | x) = qt (z) exp ( βd (x, z)) P z qt (z) exp ( βd (x, z)) (4) The algorithm results in a non-increasing sequence F p0, q0 F p1, q0 F p1, q1 ... and strictly decreasing unless reaching the limit point p (z | x) = p (z) exp ( βd (x, z)) P z p (z) exp ( βd (x, z)) Parameterised Rate Distortion. Theorem 6 in (Blahut, 1972) shows that this limit point satisfies the necessary and sufficient conditions to achieve the equality in Eq. (2). However, computing the analytical solution in Eq. (4) is intractable in practice. Therefore, we introduce a relaxation to the Blahut-Arimoto algorithm where rather than trying to find the optimal, but intractable, pt+1 (z | x) in Eq (4), we seek for the next tractable pt+1 (z | x) such that F (pt, qt) F pt+1, qt by taking a gradient step in the direction of p(z | x). With suitable choice of learning rate, this still results in a bounded and non-increasing sequence F (p, q) and converges to the optimal solution. To realize this solution, we rewrite the functional objective F over p(z | x) and q(z) in Eq. (3) with a single θ which parameterizes the conditional distribution pθ (z | x) and attains the optimal q θ(z) = P x pθ (z | x) p(x). This results in a new parameterized objective w.r.t θ: x,z p (x) pθ (z | x) d (x, z) x,z p (x) pθ (z | x) log p (x) pθ (z | x) p (x) q θ(z) (5) where q θ(z) = P x p (x) pθ (z | x) is the optimal aggregate posterior solution. We can now learn θ as follows.4 At time t, the current solution θt parameterizes the conditional distribution pt (z | x) and we take one gradient step of L (θ) to obtain pθt+1 (z | x) where qt(z) = Stop Gradient [P x p (x) pθt (z | x)] is fixed. Here, we use the notation Stop Gradient [ ] to emphasize that qt (z) pt (z) = P x p (x) pθt (z | x), and we keep qt (z) constant when updating pt+1 (z | x) in a similar spirit to the BA algorithm. The second term in Eq. (5) represents the mutual information and in alignment with rate-distortion theory we name it rate and denote by R (θ). For convenience, the objective in Eq. (5) applies the weight α to the rate term instead of β to the distortion term. Assuming the function family pθ (z | x) is rich, optimizing Eq. (5) can learn the parameterized encoder pθ (z | x) that well approximates the rate-distortion function in Eq (2). Therefore, we name it PArameterized RAte-DIstortion Stochastic Encoder (PARADISE). 4Please see the algorithm pseudo codes in the supplementary material for additional details. In the context of deep learning, the distortion function is rarely fixed but desirable to be learned. For example, d (x, z) can be the mean squared error between x the reconstruction ˆx of x from z using a decoder; or d (x, z) can be the cross entropy loss of a softmax classifier using z to predict the label y associated with x. If the distortion measure is parameterized by dφ (x, z), the parameter θ and φ can be learned through joint optimization similar to the EM procedure. Further decomposing the rate R (θ) results in the following joint objective function: L (θ, φ) = Epθ(x,z)dφ (x, z) x p (x) H [pθ (z | x)] z pθ (z | x) log q θ(z) (6) where H ( ) is the differential entropy. We call the first term in Eq. (6) the expected distortion and denote it as D (θ, φ). The objective L (θ, φ) minimizes the expected distortion while maximizing entropy and encouraging the encoder pθ (z | x) to map x to the representation z of high score log q θ(z). The entropy term in L (θ, φ) can be analytically computed for certain parameterization choices of pθ (z | x) while the score term is more challenging. If the search space of the marginal q (z) is limited to a prior distribution such as the standard Gaussian distribution N (0, I) , the rate R (θ) becomes P x p (x) KL (pθ (z | x) q (z)), which is exactly the regularization term used in β-VAE (Higgins et al., 2017) and DVIB (Alemi et al., 2016). Alternatively, we can avoid using a fixed prior q (z) by approximating the score log q θ(z) using Mini-batch Weighted Sampling (MWS) proposed in (Chen et al., 2018) (see the supplementary material for details). For a batch size of B and the dimensionality of D for the representation z, the time complexity of approximating the score is O B2D . This cost is small compared to the total cost of the forward and backpropagation passes, especially when the network architecture is deep and wide. Intuitively, the encoder pθ (z | x), when parameterized as diagonal Gaussian, maps each x to an ellipse. The entropy term in the objective L (θ, φ) (Eq. 6) encourages the ellipses to be as big as possible while the score term pulls the ellipses together. In the mean time, the encoder pθ (z | x) must satisfy the distortion constraint. This can be achieved when points x(s) that are, as implied by the distortion measure, similar are mapped to close neighborhoods in the z-space. Posterior matching for robust learning. In practical deployment, the encoder pθ (z | x) often has to deal with outof-distribution inputs. Recall that pθ (z | x) induces a soft partitioning of X, and rate minimization is intuitively equivalent to minimizing the cardinality of the partitioning. For some tasks, we may have prior knowledge of what an effi- Parameterized Rate-Distortion-based Stochastic Encoder cient partitioning should be. For example, one might expect that images that are pixel-wise close to each other should be mapped to similar representation because human are invariant to tiny changes in pixel values. Let S (x) be a sampling process that draws data points that are considered similar to x based on our prior knowledge. We can introduce inductive bias to the rate minimization procedure by adding a posterior matching (PM) term: PM (θ, S) = Ep(x)Ex ,x iid S(x)D (pθ (z | x ) , pθ (z | x )) where D ( , ) is a discrepancy measure between the posteriors. Overall, the robust learning objective function for PARADISE with posterior matching (PARADISE-PM) is: LP M (θ, φ) = D (θ, φ) + αR (θ) + γPM (θ, S) (7) There are two main considerations when applying this framework. Firstly, an appropriate distortion measure needs to be chosen according to the downstream tasks. Rate-distortion theory requires the distortion function to be bounded, but we empirically we found that this condition can be relaxed, especially when the distortion function is also jointly learned. Therefore, users have great flexibility in defining the distortion measure. Sec. 3.1 will demonstrate how different distortion measures result in different representation. An interesting scenario is when an encoder is trained using multiple distortion functions, which can be useful for multi-task learning or semi-supervised learning. Fully investigating this direction is, however, out of scope of this paper and left for future work. Sec. 3 instead focuses on applications of the proposed framework to supervised and robust learning. The second consideration is the inductive biases to be included in the learning. The sampling process S (x) can be designed depending on the task and the prior knowledge. For example, when x is an image, S (x) can be defined using appropriate transformations such as cropping, rotation or pixel perturbations. Sec. 3.3 will demonstrate that introducing such inductive bias can significantly improve the learned model s robustness to adversarial attacks. 3. Experimental Results We demonstrate several aspects of our proposed model in Sec. 3.1. Then, we report the results on supervised learning setting in Sec. 3.2. Our major results on robust learning are presented in Sec. 3.3. And finally, for additional details and to encourage reproducibility, the supplementary material contains more extensive information on the experiments as well as additional results. We will also release our source codes in public domain. 3.1. Model Behavior Here, we conduct experiments to investigate the behaviors of the proposed algorithm. We first describe how we apply PARADISE to the supervised and unsupervised setting in our experiment. Recall that the objective function of PARADISE is: L (θ, φ) = D (θ, φ) + αR (θ) (8) where R (θ) is the rate defined in Sec. 2 and D (θ, φ) = Epθ(x,z)dφ (x, z) is the expected distortion. For the unsupervised setting, we define dφ (x, z) = x gφ (z) 2 2 or the reconstruction error of x using a decoder gφ. When x comes with a label y [1, C], we define dφ (x, z) = log pφ (y | z) where pφ (y | z) is the conditional probability corresponding to a softmax classifier cφ. We parameterize p (z | x) as a Gaussian distribution N µθ (x) , diag σ2 θ (x) with µθ and σθ being the Neural Network with parameter θ. Both θ and φ are learned jointly using gradient descent and the reparameterization trick similar to VAE (Kingma & Welling, 2013). Pseudocode of the learning algorithms are described in Sec. 1.6 of the supplementary material. We conduct experiment on MNIST both in both supervised and unsupervised settings to see the impact of the distortion function and the parameter α. For ease of visualization, the dimensionality of z is set to 2. Details about the architecture and hyperparameters are in Sec. 2 of the supplementary material. Fig. 1 plots the posterior pθ (z | x) as Gaussian ellipse representing the 95% confidence region for 2, 000 images from the test set. We observe that z must retain adequate information about x to reconstruct well. Therefore, a smaller value of α is necessary, and the size of the learned posteriors as well as the level of overlapping among them is much less for the unsupervised setting. The impact of α is observed in both settings, with higher value of α leading to larger ellipses, greater level of overlapping and higher distortion loss. In both cases, harder examples are mapped to smaller posteriors near the center. Finally, when the distortion is based on class label, the posteriors form clusters well separated by color. In the unsupervised case, however, there is significant level of overlapping among class 3, 5, and 8, and between 4 and 9. This reflects the fact that these numbers, often having large blocks of similar pixel values, are considered similar when the distortion is based on the mean squared error (MSE) of reconstruction. To further investigate the latent space of PARADISE in a more complicated data set, we train PARADISE on the Celeb A data set using MSE as the distortion measure. Due to limited space, details about the architecture, hyperparameters and visualization are presented in Sec. 2 and 3 of the supplementary material. After training PARADISE, we fit a multivariate Gaussian distribution to the aggregate posterior of the unseen samples. Then, random z(s) are sampled to generate face images, from which a series of reconstructions are made. The idea is to explore what each neighborhood of z represents. Fig. 2 shows this serial reconstructions result in images of similar-looking faces with transformations such Parameterized Rate-Distortion-based Stochastic Encoder Figure 1: Visualization of the posterior p (z | x) as Gaussian ellipse representing the 95% confidence region for 2, 000 images from the test set. Top: unsupervised setting where the objective is to reconstruct x; bottom: supervised setting where the objective is to predict the class associated with x. as smile, orientation, pose, face shape, eyeglasses, gender, beard and hair style. Interestingly, we also observe linearization of these semantic transformations. For example, Fig. 3 shows that adding the difference between the z-vector of a smiling face and that of a neutral face to the z-vector of another person s neutral face generates the smiling face of that person. 3.2. Supervised Learning For the experiments in this and the next subsection, we apply the framework for supervised setting as described in the experiment on MNIST in Sec. 3.1. We conduct experiment on the Cifar-10 data set using the pre-activation Res Net (He et al., 2016b) with different number of layers, including 20, 32, 56 and 110 (He et al., 2016a). For the Image Net data set, we try only the pre-activation Res Net with 34 layers, due to limited computational resources. For all settings, we train models using 10 random seeds and take the average test accuracy, except for Image Net due to limited resources. For Cifar-10, PARADISE consistently improve test accuracy about 0.3% across architectures (see Tab. 1). On Image Net, however, top-1 and top-5 accuracy drops about 0.8% and 0.5%, respectively (see Tab. 2). We make a mild conclusion that the rate R (θ) acts as a regularizer that can be useful in certain settings but did not help for Image Net where the base Res Net-34 model shows little sign of overfitting. Figure 2: Celeb A image generation. Images in column 1 are generated from random noise. Images in columns 2 to 5 are successive reconstructions of the previous column. In each rows, one can observe similar-looking faces with transformations such as smile, face orientation and shape, gender and hairstyle. Figure 3: Examples of semantic transformation represented by linear operation in the latent space. In each sub-figure, the difference between the z-vector corresponding to the images on the right and left in the first row is added to the left images of the other rows to generate the right images. The captions describe the transformation observed. (b) Gender, Beard. Parameterized Rate-Distortion-based Stochastic Encoder Table 1: Test accuracy (in %) on Cifar-10. To compute test accuracy, we take the average results of models trained using 10 different random seeds. RN-20 RN-32 RN-56 RN-110 Standard 91.20 92.16 92.81 93.22 PARADISE 91.56 92.51 92.95 93.48 Table 2: Test accuracy (in %) on Image Net. Top-1 Top-5 Standard 72.54 90.73 PARADISE 71.73 90.27 3.3. Robust Learning The main focus of this section is on improving robustness to adversarial examples, which are carefully perturbed samples that cause AI models to misclassify although the perturbation is not perceptible to humans. The perturbation can be any kind of transformation such as changing pixel value, translating or rotating images. It is challenging to mathematically define perceptibility of perturbation to humans. Most research so far has focused on attacks within a neighborhood of the data point B (x, ϵ) = {x : x x ϵ}. Attacks can be targeted, e.g. causing the model to predict a random target class, or untargeted, e.g. simply causing the model to misclassify. In terms of method, attacks can be white-box, where the attacker has full access to the model architecture and parameters, or black-box, which does not have such information and often exploits tranferability of white-box attacks (Liu et al., 2016; Tramèr et al., 2017b). Alemi et al. (2016) claims that Deep Variational Information Bottleneck can improve adversarial robustness. However, the DVIB was evaluated only on white-box attacks. Our investigation (Sec. 3.3, supplementary material) shows that the DVIB model trained on Cifar-10 achieves significantly lower accuracy on black-box attacks than on the white-box FGSM attacks (Goodfellow et al., 2014), a sign of gradient obfuscation (Athalye et al., 2018). The issue is more serious with the higher value of α (equivalent to the β parameter in (Alemi et al., 2016)). Under stronger attacks, DVIB s adversarial accuracy degrades to 0.00% as shown in Tab. 3. We argue that an inductive bias must be introduced to the learning for the encoder pθ (z | x) to handle out-ofdistribution inputs well. Therefore, we investigate PARADISE with posterior matching (PARADISE-PM). The robust learning objective from Sec. 2 is: LP M (θ, φ) = D (θ, φ) + αR (θ) + γPM (θ, S) (9) where S (x) is a sampling process to sample points x that is similar to x based on our prior knowledge, and PM (θ, S) is the posterior matching objective: PM (θ, S) = Ep(x)Ex ,x iid S(x)D (pθ (z | x ) , pθ (z | x )) where D ( , ) is a discrepancy measure such as the Frechet distance (Dowson & Landau, 1982). We, however, found a simple discrepancy measure based on L1 is effective in practice. The sampling process S (x) can be flexibly defined using appropriate transformations such as cropping, rotation or pixel perturbations. In this work, we consider only pixel perturbations. Details about the discrepancy measure and S (x) are described in Sec. 1 of the supplementary material. The PM objective resembles the idea of logit matching proposed in (Kannan et al., 2018), which has been shown to learn a bumpier, depressed loss landscape that make it harder to attack but is still highly vulnerable (Engstrom et al., 2018). Our visual investigation, presented later, however shows that PARDADISE-PM has a smooth and highly linear loss surface. In addition, we emphasize that although PARADISE is stochastic, we make the model deterministic at inference time by feeding forward the mean of pθ (z | x) to the classifier, thus eliminating the possibility of gradient obfuscation due to the sampling operation. For Cifar-10, we use the base wide Res Net WRN-28-10 architecture (Zagoruyko & Komodakis, 2016). To evaluate adversarial robustness, we craft strong untargeted and multitargeted attacks (Gowal et al., 2019) following (Qin et al., 2019). Experiment details are presented in Sec. 2 of the supplementary material. We train DVIB and PARADISE with posterior matching and compare with the state-of-the-art baselines collected from (Qin et al., 2019), including ADV (Madry et al., 2017), TRADES (Zhang et al., 2019b) and LLR (Qin et al., 2019). It should be noted that these baselines are trained with 10-step PGD adversarial examples, thus costing about 11 training time of a standard model. Our method only requires doubling the batch size for posterior matching. Tab. 3 compares the adversarial accuracy on Cifar-10 under the two attacks with the perturbation size ϵ of 8/255. It can be observed that the PM objective significantly improves robustness of DVIB from 0.00% to 70.71%. Both DVIB-PM and PARADISE-PM outperform the baselines by a huge margin, reducing over 41% adversarial errors. Compared to DVIB-PM, PARADISE-PM achieves about 0.5% higher adversarial accuracy. Furthermore, both DVIB-PM and PARADISE-PM reduce about 50% errors on natural images in comparison to the baselines. Some recent work posits that there is a natural trade-off between standard accuracy and robustness to adversarial examples (Tsipras et al., 2018; Zhang et al., 2019b; Ilyas et al., 2019). Our result confirms this position but suggests that robustness can be achieved with a much smaller drop in standard accuracy. To evaluate our proposed method on large-scale problems, we conduct experiment on the Image Net data set. Defense Parameterized Rate-Distortion-based Stochastic Encoder Table 3: Adversarial accuracy in (%) on Cifar-10 for perturbation size of 8/255. Baselines include ADV (Madry et al., 2017), TRADES (Zhang et al., 2019b), LLR (Qin et al., 2019) and DVIB (Alemi et al., 2016). Natural Untargeted Multi-Targeted ADV 85.11 53.96 48.79 TRADES 87.40 50.46 49.48 LLR 86.83 52.99 51.13 DVIB 95.72 0.00 0.00 DVIB-PM 93.53 71.54 70.71 PARADISE-PM 93.77 72.04 71.17 Table 4: Top-1 accuracy (in %) on white-box attacks crafted on 2,000 Image Net validation images using 200 PGD steps. DENOISE refers to the Res Net 152 with denoising blocks trained with 30-step PGD attacks in Xie et al. (2019). Model Natural ϵ = 2 ϵ = 4 ϵ = 16 Res Net34 72.54 0.00 0.00 0.00 DENOISE 69.70 38.90 7.50 PARADISE-PM 61.57 34.90 16.80 0.60 methods have so far been shown to be unsuccessful on Image Net. PGD training often incurs a significant drop in accuracy, e.g. 15% to 30% depending on the perturbation size used during training while adding enormous computation burden and achieving limited success. For example, Xie et al. (2019) trains a variant of Res Net-152 against 30-step PGD attacks using 128 GPUs with batch size of 4,096 and achieves the top-1 accuracy of just 7.5% on untargeted attacks with ϵ = 16/255 (Qin et al., 2019). Even for ϵ = 2/255, Uesato et al. (2018) broke three defense methods, degrading top-1 accuracy on untargeted attacks below 1%. In this work, we simply train a humble Res Net-34, as allowed by available resources, to evaluate the potential of the proposed method. We did not manage to train DVIB-PM on Image Net despite trying various values of α and γ so we only report the results for PARADISE-PM. Tab. 4 shows top-1 accuracy on whitebox attacks for different perturbation sizes. PARADISEPM significantly improves top-1 accuracy for ϵ = 2, 4 from 0% of the base model to 34.90% and 16.80%, respectively. However, the top-1 accuracy degrades to 0.60% for ϵ = 16/255. Although not comparable, results for the Res Net-152 with denoising blocks trained with 30-step PGD adversarial examples in (Xie et al., 2019) is included in Tab. 4 for reference. Why did not PARADISE-PM repeat its success on Image Net? Model capacity is a possible reason. Fig. 4 plots train cross entropy (in red) and PM loss (in blue) over epochs. The PM loss on Image Net did not decrease during the training and is more than 10 times higher than that on Figure 4: Train cross entropy (red) and posterior matching loss (blue) over epochs. Posterior matching loss is multiplied by 1, 000 for visibility. (a) Cifar-10 (b) Image Net Table 5: Top-1 black-box accuracy (in %) on 2,000 Image Net validation images for different perturbation size ϵ. Model ϵ = 2 ϵ = 4 ϵ = 8 ϵ = 16 Standard 51.95 41.15 31.95 23.45 PARADISE-PM 60.85 60.35 59.05 49.20 Cifar-10. In our experiment, we observe that low PM loss is highly indicative of adversarial robustness. On Cifar-10, 70% of the reduction in the PM loss occurs after epoch 96 when the train cross entropy approaches zero, and train accuracy reaches almost 100%. The Image Net model, however, incurs high cross entropy loss throughout the training. Previous work showed that model capacity plays a crucial role for adversarial robustness (Kurakin et al., 2016; Madry et al., 2017). We hypothesize that PARADISE-PM would perform much better when applied to a more powerful architecture. PARADISE-PM is much more robust to black-box attacks. We use attacks from (Goodfellow et al., 2014; Kurakin et al., 2016; Madry et al., 2017; Tramèr et al., 2017a), including FGSM, R+FGSM, Step-Rand, Iter-Rand, and PGD-Rand, to craft adversarial examples and report the min accuracy. Tab. 5 shows the accuracy on black-box attacks for different perturbation sizes. For ϵ = 16/255, PARADISE-PM improves the standard model s black-box accuracy from 23.45% to 49.20%, which is just about 12% drop from PARADISE-PM s accuracy on natural images. PARADISE-PM s robustness can be explained by Fig. 5, which visualizes the loss surface and decision boundary of PARADISE-PM and the standard model when moving an input image along the signed gradient (adversarial) direction and another random Rademacher vector orthogonal to the signed gradient. The loss surface of PARADISE-PM is much smoother than that of the standard model. Along the random orthogonal direction, the loss is virtually constant, and the class prediction does not change. The standard model however shows bumpy loss surface and changes prediction decision even along the random direction. More examples presented in Sec. 3 of the supplementary material shows the same pattern that PARADISE-PM is virtu- Parameterized Rate-Distortion-based Stochastic Encoder Table 6: Attacker success rate (in %, lower is better) on 2,000 Image Net validation images. Model ϵ = 2 ϵ = 4 ϵ = 8 ϵ = 16 Standard 95.35 99.20 99.95 100 PARADISE-PM 0.30 3.55 41.20 90.45 Figure 5: Comparison of the loss surface and decision boundary of PARADISE-PM and the standard model. Top: loss plots generated by moving the input along the signed gradient (adversarial) direction and another random Rademacher vector orthogonal to the signed gradient. Bottom: the corresponding decision boundary; green represents the ground-truth class. ally constant along random directions, highly linear along the adversarial direction, and has a much simpler decision boundary. Previous work observed that gradients of different models w.r.t the same input are orthogonal (Liu et al., 2016). Our investigation (Sec. 3, supplementary material) confirms this observation. Therefore, it is not surprising that PARADISE-PM, being virtually constant along random orthogonal directions, is robust to black-box attacks. PARADISE-PM s simpler decision boundary also makes it harder for targeted attacks on PARADISE-PM. Tab. 6 reports the attacker success rate - the percentage of times an attacker successfully causes the model to predict a target class. Lower success rate is better. The success rate on the standard model reaches 95.35% for ϵ = 2 and exceeds 99% for ϵ = 4. The success rate on PARADISE-PM is only 3.55% for ϵ = 4. PARADISE-PM is still vulnerable to targeted attacks for ϵ = 16, but Fig. 6 shows that the attack on PARADISE-PM significantly changes the image of the baby to make him look like a leopard while the targeted attack on the standard model makes little change. Overall, PARADISE-PM significantly improves adversarial robustness to black-box and targeted attacks but is still vulnerable to white-box attacks. Increasing model capacity as well as introducing adversarial examples during training Figure 6: Comparison of successful targeted attacks (ϵ = 16/255). Left: the original input; middle: the attack on the standard model; right: the attack on PARADISE-PM. More examples are shown in Sec. 3 of the supplementary material. Figure 7: Visualization of the loss gradient w.r.t. input pixels. The gradients are clipped within 3 standard deviations of their mean and rescale to the range [0, 1] and visualized in RGB mode. might help optimize the PM loss better. Moreover, some recent work has focused on improving speed while achieving similar performance to PGD-adversarial training (Shafahi et al., 2019; Zhang et al., 2019a; Qin et al., 2019). In particular, Qin et al. (2019) introduces a regularization term to encourage the loss surface more linear and smooth around data points so that the inner maximization takes fewer steps to find hard adversarial examples, thus achieving a 5 speed up for adversarial training on Image Net. The visual investigation in this section shows that the loss surface of the PARADISE-PM is smooth and highly linear around the data points. Therefore, it can significantly improve efficiency of adversarial training, and combining the two approaches is a promising direction. Lastly, Fig. 7 visualizes the loss gradient w.r.t to the input pixels to help investigate the input features that strongly affect the model s prediction. We observe that PARADISEPM attention is significantly more aligned with human while the standard model s gradients look noisy and exhibit no coherent patterns. Similar behavior has been observed in PGD-trained models (Tsipras et al., 2018). More examples are presented in Sec. 3 of the supplementary material. 4. Related Work Our work is closely related to the Information Bottleneck method (Tishby et al., 2000) which measures goodness of a representation through mutual information with another variable. We, however, define goodness of a representation in a more straightforward manner by using a distortion function Parameterized Rate-Distortion-based Stochastic Encoder that directly evaluates the performance on downstream tasks. Furthermore, the algorithm proposed in (Tishby et al., 2000) follows the iterative procedure of the Blahut -Arimoto algorithm, which is infeasible to apply in practice. Our simple yet nontrivial modification of the Blahut-Arimoto algorithm makes it possible to compute the rate-distortion function with gradient-based optimization. Deep Variational Information Bottleneck (DVIB) (Alemi et al., 2016) is a practical realization of the Information Bottleneck method. Sec. 2 shows that the objective function of DVIB can be recovered within our framework when the the search space of the marginal q (z) is fixed. Alemi et al. (2016) claims that DVIB can improve robustness to adversarial attacks, but we show that DVIB is just an instance of gradient obfuscation (Athalye et al., 2018). We further demonstrate that introducing inductive bias to the learning can tremendously improve its robustness. Wang et al. (2009) proposes a rate-distortion approach for semi-supervised learning by applying the information constraint objective to unlabeled data while computing distortion loss on labeled data. A straightforward application of the PARADISE framework to semi-supervised yields a slightly different formulation where the distortion is defined as the supervised loss approximated on the available labeled data, while the rate-minimization objective is optimized on both labeled and unlabeled data. We can also employ self-supervised objectives as the distortion measure. Semi-supervised learning however is not the focus of the experiments in this work and left for future work. In the Adversarial Machine Learning literature, many defense methods have been proposed since the discovery of adversarial example phenomenon (Biggio et al., 2013; Szegedy et al., 2013) but Uesato et al. (2018); Athalye et al. (2018) proved that most did not actually improve adversarial robustness. One of the few reliable defense methods is adversarial training where a model is trained on adversarial examples crafted during the training through a few projected-gradient descent (PGD) steps of inner-maximization (Madry et al., 2017). However, three major drawbacks of PGD adversarial training are: i) using k inner-maximization steps adds approximately k times training time or requires a large number of GPUs for training; ii) adversarial training often causes a huge drop in standard accuracy; iii) PGD adversarial training still does not scale to Image Net. PARADISE-PM adds virtually no additional cost to standard training except for doubling the batch size, experiences a much smaller drop standard accuracy and has a smooth, highly linear loss surface which may require fewer PGD steps to find good adversarial examples. Therefore, combining PARADISE-PM with adversarial training is a promising direction. 5. Discussion and Conclusion We have presented an efficient and elegant framework for representation learning based on rate-distortion theory with extensive experimental evaluation to demonstrate its merits. The resulting model is a new class stochastic encoders whose key versatility lies in the great freedom in defining the distortion function. A particular interesting scenario is learning representation using multiple distortion functions, which is directly relevant to multi-task learning and semi-supervised learning. PARADISE might also extend to multiple data sources by computing the rate-distortion function for a product source (Shannon, 1959). Under the PARADISE framework, one can introduce inductive bias to explicitly influence the partitioning of the input space. The idea of posterior matching is simple yet extremely effective for adversarial robustness. Combining PARADISE-PM with adversarial training is a promising direction to tackle the Adversarial Machine Learning problem at large scale. In this work, we considered only robustness to pixel perturbation, but posterior matching can be applied to other kinds of transformation. Furthermore, robustness is not limited to adversarial attacks. Many AI systems today are quite sensitive. For example, the text recognizer in an OCR pipeline may output differently because of updates on the text detector causing changes in the cropped inputs to the text recognizer. The idea of posterior matching can help to deal with such nuisance. Acknowledgments. This work was partially supported by the US Air Force grant FA2386-19-1-4040. Parameterized Rate-Distortion-based Stochastic Encoder Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. ar Xiv preprint ar Xiv:1612.00410, 2016. Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14 20, 1972. Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420, 2018. Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndi c, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387 402. Springer, 2013. Blahut, R. Computation of channel capacity and ratedistortion functions. IEEE transactions on Information Theory, 18(4):460 473, 1972. Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610 2620, 2018. Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 2012. Dowson, D. and Landau, B. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450 455, 1982. Engstrom, L., Ilyas, A., and Athalye, A. Evaluating and understanding the robustness of adversarial logit pairing. ar Xiv preprint ar Xiv:1807.10272, 2018. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014. Gowal, S., Uesato, J., Qin, C., Huang, P.-S., Mann, T., and Kohli, P. An alternative surrogate loss for pgd-based adversarial testing. ar Xiv preprint ar Xiv:1910.09338, 2019. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016a. He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630 645. Springer, 2016b. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. ICLR, 2(5):6, 2017. Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., and Madry, A. Adversarial examples are not bugs, they are features. ar Xiv preprint ar Xiv:1905.02175, 2019. Kannan, H., Kurakin, A., and Goodfellow, I. Adversarial logit pairing. ar Xiv preprint ar Xiv:1803.06373, 2018. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236, 2016. Le Cun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Liu, Y., Chen, X., Liu, C., and Song, D. Delving into transferable adversarial examples and black-box attacks. ar Xiv preprint ar Xiv:1611.02770, 2016. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730 3738, 2015. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Qin, C., Martens, J., Gowal, S., Krishnan, D., Dvijotham, K., Fawzi, A., De, S., Stanforth, R., and Kohli, P. Adversarial robustness through local linearization. In Advances in Neural Information Processing Systems, pp. 13824 13833, 2019. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3): 211 252, 2015. Shafahi, A., Najibi, M., Ghiasi, A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! ar Xiv preprint ar Xiv:1904.12843, 2019. Parameterized Rate-Distortion-based Stochastic Encoder Shannon, C. E. A mathematical theory of communication. Bell system technical journal, 27(3):379 423, 1948. Shannon, C. E. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec, 4(142-163):1, 1959. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. ar Xiv preprint physics/0004057, 2000. Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., and Mc Daniel, P. Ensemble adversarial training: Attacks and defenses. ar Xiv preprint ar Xiv:1705.07204, 2017a. Tramèr, F., Papernot, N., Goodfellow, I., Boneh, D., and Mc Daniel, P. The space of transferable adversarial examples. ar Xiv preprint ar Xiv:1704.03453, 2017b. Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. ar Xiv preprint ar Xiv:1805.12152, 2018. Uesato, J., O Donoghue, B., Oord, A. v. d., and Kohli, P. Adversarial risk and the dangers of evaluating against weak attacks. ar Xiv preprint ar Xiv:1802.05666, 2018. Wang, Y., Haffari, G., Wang, S., and Mori, G. A rate distortion approach for semi-supervised conditional random fields. In Advances in Neural Information Processing Systems, pp. 2008 2016, 2009. Xie, C., Wu, Y., Maaten, L. v. d., Yuille, A. L., and He, K. Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 501 509, 2019. Zagoruyko, S. and Komodakis, N. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. Zhang, D., Zhang, T., Lu, Y., Zhu, Z., and Dong, B. You only propagate once: Accelerating adversarial training via maximal principle. ar Xiv preprint ar Xiv:1905.00877, 2019a. Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I. Theoretically principled trade-off between robustness and accuracy. ar Xiv preprint ar Xiv:1901.08573, 2019b.