# controlvae_controllable_variational_autoencoder__9c01f70a.pdf Control VAE: Controllable Variational Autoencoder Huajie Shao 1 Shuochao Yao 1 Dachun Sun 1 Aston Zhang 2 Shengzhong Liu 1 Dongxin Liu 1 Jun Wang 3 Tarek Abdelzaher 1 Variational Autoencoders (VAE) and their variants have been widely used in a variety of applications, such as dialog generation, image generation and disentangled representation learning. However, the existing VAE models may suffer from KL vanishing in language modeling and low reconstruction quality for disentangling. To address these issues, we propose a novel controllable variational autoencoder framework, Control VAE, that combines a controller, inspired by automatic control theory, with the basic VAE to improve the performance of resulting generative models. Specifically, we design a new non-linear PI controller, a variant of the proportional-integral-derivative (PID) control, to automatically tune the hyperparameter (weight) added in the VAE objective using the output KL-divergence as feedback during model training. The framework is evaluated using three applications; namely, language modeling, disentangled representation learning, and image generation. The results show that Control VAE can achieve much better reconstruction quality than the competitive methods for the comparable disentanglement performance. For language modeling, it not only averts the KL-vanishing, but also improves the diversity of generated text. Finally, we also demonstrate that Control VAE improves the reconstruction quality for image generation compared to the original VAE. 1Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA. 2AWS Deep Learning, CA, USA. 3Alibaba Group, Seattle, WA. Correspondence to: Tarek Abdelzaher , Huajie Shao . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). 1. Introduction This paper proposes a novel controllable variational autoencoder, Control VAE 1, that leverages automatic control to precisely control the trade-off between data reconstruction accuracy bounds (from a learned latent representation) and application-specific constraints, such as output diversity or disentangled latent factor representation. Specifically, a controller is designed that stabilizes the value of KL-divergence (between the learned approximate distribution of the latent variables and their true distribution) in the VAE s objective function to achieve the desired trade-off, thereby improving application-specific performance metrics of several existing VAE models. The work is motivated by the increasing popularity of VAEs as an unsupervised generative modeling framework that learns an approximate mapping between Gaussian latent variables and data samples when the true latent variables have an intractable posterior distribution (Sohn et al., 2015; Kingma & Welling, 2013). Since VAEs can directly work with both continuous and discrete input data (Kingma & Welling, 2013), they have been widely adopted in various applications, such as image generation (Yan et al., 2016; Liu et al., 2017), dialog generation (Wang et al., 2019; Hu et al., 2017), and disentangled representation learning (Higgins et al., 2017; Kim & Mnih, 2018). Popular VAE applications often involve a trade-off between reconstruction accuracy bounds and some other application-specific goal, effectively manipulated through KL-divergence. For example, in (synthetic) text or image generation, a goal is to produce new original text or images, as opposed to reproducing one of the samples in training data. In text generation, if KL-divergence is too low, output diversity is reduced (Bowman et al., 2015), which is known as the KL-vanishing problem. To increase output diversity, it becomes advantageous to artificially increase KL-divergence. The resulting approximation was shown to produce more diverse, yet still authentic-looking outputs. Conversely, disentangled representation learning (Denton et al., 2017) leverages the observation that KL-divergence in the VAE constitutes an upper bound on information transfer 1Source code is publicly available at https://github. com/shj1987/Control VAE-ICML2020.git Control VAE: Controllable Variational Autoencoder through the latent channels per data sample (Burgess et al., 2018). Artificially decreasing KL-divergence (e.g., by increasing its weight in a VAE s objective function, which is known as the β-VAE) therefore imposes a stricter information bottleneck, which was shown to force the learned latent factors to become more independent (i.e., non-redundant), leading to a better disentangling. The above examples suggest that a useful extension of VAEs is one that allows users to exercise explicit control over KL-divergence in the objective function. Control VAE realizes this extension. We apply Control VAE to three different applications: language modeling, disentangling, and image generation. Evaluation results on real-world datasets demonstrate that Control VAE is able to achieve an adjustable trade-off between reconstruction error and KL-divergence. It can discover more disentangled factors and significantly reduce the reconstruction error compared to the β-VAE (Burgess et al., 2018) for disentangling. For language modeling, it can not only completely avert the KL vanishing problem, but also improve the diversity of generated data. Finally, we also show that Control VAE improves the reconstruction quality on image generation task via slightly increasing the value of KL divergence compared with the original VAE. 2. Preliminaries The objective function of VAEs consists of two terms: loglikelihood and KL-divergence. The first term tries to reconstruct the input data, while KL-divergence has the desirable effect of keeping the representation of input data sufficiently diverse. In particular, KL-divergence can affect both the reconstruction quality and diversity of generated data. If the KL-divergence is too high, it would affect the accuracy of generated samples. If it is too low, output diversity is reduced, which may be a problem in some applications such as language modeling (Bowman et al., 2015) (where it is known as the KL-vanishing problem). To mitigate KL vanishing, one promising way is to add an extra hyperparameter β(0 β 1) in the VAE objective function to control the KL-divergence via increasing β from 0 until to 1 with sigmoid function or cyclic function (Liu et al., 2019). These methods, however, blindly change β without sampling the actual KL-divergence during model training. Using a similar methodology, researchers recently developed a new β-VAE (β > 1) (Higgins et al., 2017; Burgess et al., 2018) to learn the disentangled representations by controlling the value of KL-divergence. However, β-VAE suffers from high reconstruction errors (Kim & Mnih, 2018), because it adds a very large β in the VAE objective so the model tends to focus disproportionately on optimizing the KL term. In addition, its hyperparameter is fixed during model training, missing the chance of balancing the reconstruction error and KL-divergence. #(%) Desired KL '((%) )(%) *+(,|.) log(2(3|4) )(%)'((5 4 3 ||2 4 ) VAE objective Encoder Decoder 5(4|3) 2(3|4) Figure 1. Framework of Control VAE. It combines a controller with the basic VAE framework to stabilize the KL divergence to a specified value via automatically tuning the weight β(t) in the objective. The core technical challenge responsible for the above application problems lies in the difficulty to tune the weight of the KL-divergence term during model training. Inspired by control systems, we fix this problem using feedback control. Our controllable variational autoencoder is illustrated in Fig. 1. It samples the output KL-divergence at each training step t, and feeds it into an algorithm that tunes the hyperparameter, β(t), accordingly, aiming to stabilize KL-divergence at a desired value, called the set point. We further design a non-linear PI controller, a variant of the PID control algorithm ( Astr om et al., 2006), to tune the hyperparameter β(t). PID control is the basic and most prevalent form of feedback control in a large variety of industrial ( Astr om et al., 2006) and software performance control (Hellerstein et al., 2004) applications. The general model of PID controller is defined by β(t) = Kpe(t) + Ki 0 e(τ)dτ + Kd de(t) where β(t) is the output of the controller; e(t) is the error between the actual value and the desired value at time t; Kp, Ki and Kd denote the coefficients for the P term, I term and D term, respectively. The basic idea of the PID algorithm is to calculate an error, e(t), between a set point (in this case, the desired KLdivergence) and the current value of the controlled variable (in this case, the actual KL-divergence), then apply a correction in a direction that reduces that error. The correction is applied to some intermediate directly accessible variable (in our case, β(t)) that influences the value of the variable we ultimately want to control (KL-divergence). In general, the correction computed by the controller is the weighted sum of three terms; one changes with error (P), one changes with the integral of error (I), and one changes with the derivative of error (D). In a nonlinear controller, the changes can be described by nonlinear functions. Note that, since derivatives essentially compute the slope of a signal, when the signal is noisy, the slope often responds more to variations induced Control VAE: Controllable Variational Autoencoder by noise. Hence, following established best practices in control of noisy systems, we do not use the derivative (D) term in our specific controller. Next, we introduce VAEs and our objective in more detail. 2.1. The Variational Autoencoder (VAE) Suppose that we have a dataset x of n i.i.d. samples that are generated by the ground-truth latent variable z, interpreted as the representation of the data. Let pθ(x|z) denote a probabilistic decoder with a neural network to generate data x given the latent variable z. The distribution of representation corresponding to the dataset x is approximated by the variational posterior, qφ(z|x), which is produced by an encoder with a neural network. The Variational Autoencoder (VAE) (Rezende et al., 2014; Kingma & Welling, 2013) has been one of the most popular generative models. The basic idea of VAE can be summarized in the following: (1) VAE encodes the input data samples x into a latent variable z as its distribution of representation via a probabilistic encoder, which is parameterised by a neural network. (2) then adopts the decoder to reconstruct the original input data based on the samples from z. VAE tries to maximize the marginal likelihood of the reconstructed data, but it involves intractable posterior inference. Thus, researchers adopt backpropagation and stochastic gradient descent (Kingma & Welling, 2013) to optimize its variational lower bound of log likelihood (Kingma & Welling, 2013). log pθ(x) Lvae = Eqφ(z|x)[log pθ(x|z)] DKL(qφ(z|x)||p(z)), (2) where p(z) is the prior distribution, e.g., Unit Gaussian. pθ(x|z) is a probabilistic decoder parameterized by a neural network to generate data x given the latent variable z, and the posterior distribution of latent variable z given data x is approximated by the variational posterior, qφ(z|x), which is parameterized by an encoder network. The VAE is trained by maximizing Lvae, which consists of a reconstruction term and a KL term, over the training data. However, the basic VAE models cannot explicitly control the KL-divergence to a specified value. They also often suffer from KL vanishing (in language modeling (Bowman et al., 2015; Liu et al., 2019)), which means the KL-divergence becomes zero during optimization. β-VAE (Higgins et al., 2017; Chen et al., 2018a) is an extension to the basic VAE framework, often used as an unsupervised method for learning a disentangled representation of the data generative factors. A disentangled representation, according to the literature (Bengio et al., 2013), is defined as one where single latent units are sensitive to changes in single generative factors, while being relatively invariant to changes in other factors. Compared to the original VAE, β-VAE adds an extra hyperparameter β(β > 1) as a weight of KL-divergence in the original VAE objective (2). It can be expressed by Lβ = Eqφ(z|x)[log pθ(x|z)] βDKL(qφ(z|x)||p(z)). (3) In order to discover more disentangled factors, researchers further put a constraint on total information capacity, C, to control the capacity of the information bottleneck (KLdivergence) (Burgess et al., 2018). Then Lagrangian method is adopted to solve the following optimization problem. Lβ = Eqφ(z|x)[log pθ(x|z)] β|DKL(qφ(z|x)||p(z)) C|, (4) where β is a large hyperparameter (e.g., 100). However, one drawback of β-VAE is that it obtains good disentangling at the cost of reconstruction quality. When the weight β is large, the optimization algorithm tends to optimize the second term in (4), leading to a high reconstruction error. The above background suggests that a common challenge in applying VAEs (and their extensions) lies in appropriate weight allocation among the reconstruction accuracy and KL-divergence in the VAEs objective function. As mentioned earlier, we solve this using a nonlinear PI controller that manipulates the value of the non-negative hyperparameter, β(t). This algorithm is described next. 3. The Control VAE Algorithm During model training, we sample the output KL-divergence, which we denote by ˆvkl(t), at training step t. The sampled KL-divergence is then compared to the set point, vkl, and the difference, e(t) = vkl ˆvkl(t) then used as the feedback to a controller to calculate the hyperparameter β(t). Control VAE can be expressed by the following variational lower bound: L = Eqφ(z|x)[log pθ(x|z)] β(t)DKL(qφ(z|x)||p(z)). (5) When KL-divergence drops below the set point, the controller counteracts this change by reducing the hyperparameter β(t) (to reduce penalty for KL-divergence in the objective function (5)). The reduced weight, β(t), allows KL-divergence to grow, thus approaching the set point again. Conversely, when KL-divergence grows above the set point, the controller increases β(t) (up to a certain value), thereby increasing the penalty for KL-divergence and forcing it to decrease. This effect is achieved by computing β(t) using Equation (6), below, which is an instance of nonlinear PI control: β(t) = Kp 1 + exp(e(t)) Ki j=0 e(j) + βmin, (6) Control VAE: Controllable Variational Autoencoder "# 1 + exp(* + ) *(+) Desired KL Value "6(+) Figure 2. PI controller. It uses the output KL-divergence at training step t as the feedback to the PI algorithm to compute β(t). where Kp and Ki are the constants. The first term (on the right hand side) ranges between 0 and Kp thanks to the exponential function exp(.). Note that when error is large and positive (KL-diverge is below set point), the first term approaches 0, leading to a lower β(t) that encourages KL-divergence to grow. Conversely, when error is large and negative (KL-divergence above set point), the first term approaches its maximum (which is Kp), leading to a higher β(t) that encourages KL-divergence to shrink. The second term of the controller sums (integrates) past errors with a sampling period T (one training step in this paper). This creates a progressively stronger correction (until the sign of the error changes). The negative sign ensures that while errors remain positive (i.e., when KL-divergence is below set point), this term continues to decrease, whereas while errors remain negative (i.e., when KL-divergence is above set point), this term continues to increase. In both cases, the change forces β(t) in a direction that helps KLdivergence approach the set point. In particular, note that when the error becomes zero, the second term (and thus the entire right hand side) stops changing, allowing controller output, β(t), to stay at the same value that hopefully caused the zero error in the first place. This allows the controller to lock in the value of β(t) that meets the KL-divergence set point. Finally, βmin is an application-specific constant. It effectively shifts the range within which β(t) is allowed to vary. This PI controller is illustrated in Fig. 2. 3.1. PI Parameter Tuning for Control VAE One challenge of applying the PI control algorithm lies how to tune its parameters, Kp and Ki effectively. While optimal tuning of nonlinear controllers is non-trivial, in this paper we follow a very simple rule: tune these constants to ensure that reactions to errors are sufficiently smooth to allow gradual convergence. Let us first consider the coefficient Kp. Observe that the maximum (positive) error occurs when actual KL-divergence is close to zero. In this case, if vkl is the set point on KL-divergence, then the error, e(t), is approximated by e(t) vkl 0 = vkl. When KL-divergence is too small, the VAE does not learn useful information from input data (Liu et al., 2019). We need to assign β(t) a very small non-negative value, so that KL-divergence is encouraged to grow (when the resulting objective function is optimized). In other words, temporarily ignoring other terms in Equation (6), the contribution of the first term alone should be sufficiently small: Kp 1 + exp(vkl) ϵ, (7) where ϵ is a small constant (e.g., 10 3 in our implementation). The above (7) can also be rewritten as Kp (1 + exp(vkl))ϵ. Empirically, we find that Kp = 0.01 leads to good performance and satisfies the above constraint. Conversely, when the actual KL-divergence is much larger than the desired value vkl, the error e(t) becomes a large negative value. As a result, the first term in (6) becomes close to a constant, Kp. If the resulting larger value of β(t) is not enough to cause KL-divergence to shrink, one needs to gradually continue to increase β(t). This is the job of second term. The negative sign in front of that term ensures that when negative errors continue to accumulate, the positive output β(t) continues to increase. Since it takes lots of steps to train deep VAE models, the increase per step should be very small, favoring smaller values of Ki. Empirically we found that a value Ki between 10 3 and 10 4 stabilizes the training. Note that, Ki should not be too small either, because it would then unnecessarily slow down the convergence. 3.2. Set Point Guidelines for Control VAE The choice of desired value of KL-divergence (set point) is largely application specific. In general, when βmin β(t) βmax, the upper bound of expected KL-divergence is the value of KL-divergence as Control VAE converges when β(t) = βmin, denoted by Vmax. Similarly, its lower bound, Vmin, can be defined as the KL-divergence produced by Control VAE when β(t) = βmax. For feedback control to be most effective (i.e., not run against the above limits), the KL-divergence set point should vary in the range of [Vmin, Vmax]. Since Control VAE is an end-to-end learning model, users can customize the desired value of KLdivergence (using KL-divergence of the original VAE as a reference) to meet their demand with respect to different applications. For instance, if some users prefer to improve the diversity of text generation and image generation, they can slightly increase the KL-divergence produced by the original VAE. Otherwise they can reduce the KL-divergence of original VAE if they want to improve the generation accuracy. 3.3. Summary of the PI Control Algorithm We summarize the proposed PI control algorithm in Algorithm 1. Our PI algorithm updates the hyperparameter, β(t), with the feedback from sampled KL-divergence at train- Control VAE: Controllable Variational Autoencoder Algorithm 1 PI algorithm. 1: Input: desired KL vkl, coefficients Kp, Ki, max/min value βmax, βmin, iterations N 2: Output: hyperparameter β(t) at training step t 3: Initialization: I(0) = 0, β(0) = 0 4: for t = 1 to N do 5: Sample KL-divergence, ˆvkl(t) 6: e(t) vkl ˆvkl(t) 7: P(t) Kp 1+exp(e(t)) 8: if βmin β(t 1) βmax then 9: I(t) I(t 1) Kie(t) 10: else 11: I(t) I(t 1) // Anti-windup 12: end if 13: β(t) P(t) + I(t) + βmin 14: if β(t) > βmax then 15: β(t) βmax 16: end if 17: if β(t) < βmin then 18: β(t) βmin 19: end if 20: Return β(t) 21: end for ing step t. Line 6 computes the error between the desired KL-divergence, vkl(t), and the sampled ˆvkl(t). Line 7 to 9 calculate the P term and I term for the PI algorithm, respectively. Note that, Line 10 and 11 is a popular constraint in PID/PI design, called anti-windup (Azar & Serrano, 2015; Peng et al., 1996). It effectively disables the integral term of the controller when controller output gets out of range, not to exacerbate the out-of-range deviation. Line 13 is the calculated hyperparameter β(t) by PI algorithm in (6). Finally, Line 14 to 19 aim to limit β(t) to a certain range, [βmin, βmax]. 3.4. Applications of Control VAE As a preliminary demonstration of the general applicability of the above approach and as an illustration of its customizability, we apply Control VAE to three different applications stated below. Language modeling: We first apply Control VAE to solve the KL vanishing problem meanwhile improve the diversity of generated data. As mentioned in Section 2.1, the VAE models often suffer from KL vanishing in language modeling. The existing methods cannot completely solve the KL vanishing problem or explicitly manipulate the value of KL-divergence. In this paper, we adopt Control VAE to control KL-divergence to a specified value to avoid KL vanishing using the output KL-divergence. Following PI tuning strategy in Section 3.1, we set Kp, Ki of the PI algorithm in (6) to 0.01 and 0.0001, respectively. In addition, βmin is set to 0 and the maximum value of β(t) is limited to 1. Disentangling: We then apply the Control VAE model to achieve a better trade-off between reconstruction quality and disentangling. As mentioned in Section 2.2, β-VAE (β > 1) assigns a large hyperparameter to the objective function to control the KL-divergence (information bottleneck), which, however, leads to a large reconstruction error. To mitigate this issue, we adopt Control VAE to automatically adjust the hyperparameter β(t) based on the output KL-divergence during model training. Using the similar methodology in (Burgess et al., 2018), we train a single model by gradually increasing KL-divergence from 0.5 to a desired value C with a step function α for every K training steps. Since β(t) > 1, we set βmin to 1 for the PI algorithm in (6). Following the PI tuning method above, the coefficients Kp and Ki are set to 0.01 and 0.001, respectively. Image generation: In this paper, we try to leverage Control VAE to manipulate (slightly increase) the value of KL-divergence to improve the reconstruction quality for image generation. Different from the original VAE (β(t) = 1), we extend the range of the hyperparameter, β(t), from 0 to 1 in our control VAE model. Given a desired KL-divergence, control VAE can automatically tune β(t) within that range. For this task, we use the same PI control algorithm and hyperparameters as the above language modeling. 4. Experiments We evaluate the performance of Control VAE on benchmark datasets in the three different applications mentioned above. 4.1. Datasets The datasets used for our experiments are introduced below. Language modeling: 1) Penn Tree Bank (PTB) (Marcus et al., 1993): it consists of 42, 068 training sentences, 3, 370 validation sentences and 3, 761 testing sentences. 2) Switchboard(SW) (Godfrey & Holliman, 1997): it has 2400 two-sided telephone conversations with manually transcribed speech and alignment. The data is randomly split into 2316, 60 and 62 dialog for training, validation and testing. Disentangling: 1) 2D Shapes (Matthey et al., 2017): it has 737, 280 binary 64 64 images of 2D shapes with five ground truth factors (number of values): shape(3), scale(6), orientation(40), x-position(32), yposition(32) (Kim & Mnih, 2018). Image generation: 1) Celeb A(cropped version) (Liu et al., 2015): It has 202, 599 RGB 128 128 3 images of celebrity faces. The data is split into 192, 599 and 10, 000 images for training and testing. Control VAE: Controllable Variational Autoencoder 0K 20K 40K 60K 80K Training steps KL divergence Control VAE-KL-1.5 Control VAE-KL-3 Cost-anneal-10K Cost-anneal-20K Cyclical-anneal-4 Cyclical-anneal-8 (a) KL divergence 0K 20K 40K 60K 80K Training steps Reconstruction Loss Control VAE-KL-1.5 Control VAE-KL-3 Cost-anneal-10K Cost-anneal-20K Cyclical-anneal-4 Cyclical-anneal-8 (b) Reconstruction loss 0K 20K 40K 60K 80K Training steps Hyperparameter β(t) Control VAE-KL-1.5 Control VAE-KL-3 Cost-anneal-10K Cost-anneal-20K Cyclical-anneal-4 Cyclical-anneal-8 Figure 3. Performance comparison for different methods on the PTB data. (a) shows that Control VAE and Cyclical annealing (4, 8 cycles) can avert KL vanishing, while Cost annealing still suffers from KL vanishing after 20K and 50K training steps. Moreover, Control VAE can control the KL-divergence and also has lower reconstruction errors than the other methods in (b). Table 1. Performance comparison for different methods on dialog-generation using SW data over 5 random seeds. Dis-n: higher is better. PPL: lower is better, and self-BLEU lower is better. Methods/metric Dis-1 Dis-2 self-BLEU-2 self-BLEU-3 PPL Control VAE-KL-35 6.27K 41 95.86K 1.02K 0.663 0.012 0.447 0.013 8.81 0.05 Control VAE-KL-25 6.10K 60 83.15K 4.00K 0.698 0.006 0.495 0.014 12.47 0.07 Cost anneal-KL-17 5.71K 87 69.60K 1.53K 0.721 0.010 0.536 0.008 16.82 0.11 Cyclical (KL = 21.5) 5.79K 81 71.63K 2.04K 0.710 0.007 0.524 0.008 17.81 0.33 4.2. Model Configurations The detailed model configurations and hyperparameter settings for each model is presented in Appendix A. 4.3. Evaluation on Language Modeling First, we compare the performance of Control VAE with the following baselines for mitigating KL vanishing in text generation (Bowman et al., 2015). Cost annealing (Bowman et al., 2015): This method gradually increases the hyperparameter on KL-divergence from 0 until to 1 after N training steps using sigmoid function. Cyclical annealing (Liu et al., 2019): This method splits the training process into M cycles and each increases the hyperparameter from 0 until to 1 using a linear function. Fig. 3 illustrates the comparison results of KL-divergence, reconstruction loss and hyperparamter β(t) for different methods on the PTB dataset. Note that, here Control VAEKL-v means we set the KL-divergence to a desired value v (e.g., 3) for our PI controller following the set point guidelines in Section 3.2. Cost-annealing-v means we gradually increase the hyperparameter, β(t), from 0 until to 1 after v steps using sigmoid function. We observe from Fig. 3(a) that Control VAE (KL=1.5, 3) and Cyclical annealing (4, 8 cycles) can avert the KL vanishing. However, our Control VAE is able to stabilize the KL-divergence while cyclical annealing could not. Moreover, our method has a lower reconstruction loss than the cyclical annealing in Fig. 3 (b). Cost annealing method still suffers from KL vanishing, because we use the Transformer (Vaswani et al., 2017) as the decoder, which can predict the current data based on previous ground-truth data. Fig. 3 (c) illustrates the tuning result of β(t) by Control VAE compared with other methods. We can discover that our β(t) gradually converges to around a certain value. Note that, here β(t) of Control VAE does not converge to 1 because we slightly increase the value of KL-divergence (produced by the original VAE) in order to improve the diversity of generated data. In order to further demonstrate Control VAE can improve the diversity of generated text, we apply it to dialog-response generation using the Switchboard(SW) dataset. Following (Zhao et al., 2017), we adopt a conditional VAE (Zhao et al., 2017) that generates dialog conditioned on the previous response. We use metric Dis-n (Xu et al., 2018) and self-BLEU (Zhu et al., 2018) (with 1000 sampled results) to measure the diversity of generated data, and perplexity (PPL) (Jelinek et al., 1977) to measure how well the probability distribution predicts a sample. Table 1 illustrates the comparison results for different approaches. We can observe that Control VAE has more distinct grams and lower self BLEU than the baselines when the desired KL-divergence is set to 35 and 25. In addition, it has lower PPL than the other methods. Thus, we can conclude that Control VAE can improve the diversity of generated data and generation performance. We also illustrate some examples of generated dialog by Control VAE in Appendix B. Control VAE: Controllable Variational Autoencoder 0K 200K 400K 600K 800K 1000K1200K Training steps Reconstruction Loss beta-VAE-100 Factor VAE-10 Control VAE-KL-18 Control VAE-KL-16 (a) Reconstruction loss 0K 200K 400K 600K 800K 1000K1200K Training steps Hyperparameter β(t) beta-VAE-100 Factor VAE-10 Control VAE-KL-18 Control VAE-KL-16 0K 200K 400K 600K 800K 1000K Training steps KL Divergence z1 z2 (y) z3 (Scale) z4 (Shape) z5 z6 (x) z7 (Orientation) z8 z9 z10 total KL (c) Disentangled factors Figure 4. (a) (b) shows the comparison of reconstruction error and β(t) using 2D Shapes data over 5 random seeds. Control VAE (KL=16, 18) has lower reconstruction errors and variance compared to β-VAE. (c) shows an example about the disentangled factors in the latent variable as the total KL-divergence increases from 0.5 to 18 for Control VAE (KL=18). Each curve with positive KL-divergence (except black one) represents one disentangled factor by Control VAE. Table 2. Performance comparison of different methods using disentanglement metric, MIG score, averaged over 5 random seeds. The higher is better. Control VAE (KL=16) has a comparable MIG score but lower variance than the Factor VAE with the default parameters. Metric Control VAE (KL=16) Control VAE (KL=18) β-VAE (β = 100) Factor VAE (γ = 10) MIG 0.5628 0.0222 0.5432 0.0281 0.5138 0.0371 0.5625 0.0443 Control VAE (KL=16) !-VAE (! = 100) scale orient shape Factor VAE (% = 10) Figure 5. Rows: latent traversals ordered by the value of KL-divergence with the prior in a descending order. Following work (Burgess et al., 2018), we initialize the latent representation from a seed image, and then traverse a single latent dimension in a range of [ 3, 3], while keeping the remaining latent dimensions fixed. Control VAE can disentangle all the five generative factors for 2D Shapes data, while β-VAE entangles the scale and shape (in 3rd row) and Factor VAE does not disentangle orientation (in 4th row) very well. 4.4. Evaluation on Disentangled Representations We then evaluate the performance of Control VAE on the learning of disentangled representations using 2D Shapes data. We compare it with two baselines: Factor VAE (Kim & Mnih, 2018) and β-VAE (Burgess et al., 2018). Fig. 4 (a) and (b) shows the comparison of reconstruction error and the hyperparameter β(t) (using 5 random seeds) for different models. We can observe from Fig. 4 (a) that Control VAE (KL=16,18) has lower reconstruction error and variance than the baselines. This is because our Control VAE automatically adjusts the hyperparameter, β(t), to stabilize the KL-divergence, while the other two methods keep the hyperparameter unchanged during model training. Specifically, for Control VAE (KL=18), the hyperparameter β(t) is high in the beginning in order to obtain good disentangling, and then it gradually drops to around 1.8 as the training con- verges, as shown in Fig. 4(b). In contrast, β-VAE (β = 100) has a large and fixed weight on the KL-divergence so that its optimization algorithm tends to optimize the KL-divergence term, leading to a large reconstruction error. In addition, Fig. 4(c) illustrates an example of KL-divergence per factor in the latent code as training progresses and the total information capacity (KL-divergence) increases from 0.5 until to 18. We can see that Control VAE disentangles all the five generative factors, starting from positional latents (x and y) to scale, followed by orientation and then shape. To further demonstrate Control VAE can achieve a better disentangling, we use a disentanglement metric, mutual information gap (MIG) (Chen et al., 2018a), to compare their performance, as shown in Table 2. It can be observed that Control VAE (KL=16) has a comparable MIG but lower variance than Factor VAE. Here it is worth noting that Factor VAE adds a Total Correlation (TC) term in the objective Control VAE: Controllable Variational Autoencoder while our method does not. Since there does not exist a robust metric to fully measure disentanglement, we also show the qualitative results of different models in Fig. 5. We can observe that Control VAE can discover all the five generative factors: positional latent (x and y), scale, orientation and shape. However, β-VAE (β = 100) disentangles four generative factors except for entangling the scale and shape together (in the third row), while Factor VAE (γ = 10) does not disentangle the orientation factor very well in the fourth row in Fig. 5. Thus, Control VAE achieves a better reconstruction quality and disentangling than the baselines. 4.5. Evaluation on Image Generation Finally, we compare the reconstruction quality on image generation task for Control VAE and the original VAE. Fig. 6 shows the comparison of reconstruction error and KLdivergence under different desired values of KL-divergence averaged over 3 random seeds. We can see from Fig. 6(a) that Control VAE-KL-200 (KL=200) has the lowest reconstruction error among them. This is because there exists a trade-off between the reconstruction accuracy and KLdivergence in the VAE objective. When the value of KLdivergence rises, which means the weight of KL term decreases, the model tends to optimize the reconstruction term. In addition, as we set the desired KL-divergence to 170 (same as the basic VAE in Fig. 6(b)), Control VAE has the same reconstruction error as the original VAE. At that point, Control VAE becomes the original VAE as β(t) finally converges to 1, as shown in Fig. 7 in Appendix C. We further adopt two commonly used metrics for image generation, FID (Lucic et al., 2018) and SSIM (Chen et al., 2018b), to evaluate the performance of Control VAE in Table 3. It can be observed that Control VAE-KL-200 outperforms the other methods in terms of FID and SSIM. Therefore, Control VAE can improve the reconstruction quality for image generation task via slightly increasing the value of KL-divergence. We also show some examples to verify Control VAE has a better reconstruction quality than the basic VAE in Appendix D. 0K 200K 400K 600K 800K 1000K 1200K Training steps Reconstruction Loss Control VAE-KL-170 Control VAE-KL-180 Control VAE-KL-200 Orginal-VAE (a) Reconstruction loss 0K 200K 400K 600K 800K 1000K 1200K Training steps KL Divergence Control VAE-KL-170 Control VAE-KL-180 Control VAE-KL-200 Orginal-VAE (b) KL-divergence Figure 6. Performance comparison for different methods on the Celeb A data. Table 3. Performance comparison for different methods on Celeb A data over 3 random seeds. FID: lower is better. SSIM: higher is better. Methods/metric FID SSIM Control VAE-KL-200 55.16 0.187 0.687 0.0002 Control VAE-KL-180 57.57 0.236 0.679 0.0003 Control VAE-KL-170 58.75 0.286 0.675 0.0001 Original VAE 58.71 0.207 0.675 0.0001 5. Related Work We review the related work about VAE and its variants in various applications, and then point out the difference between our method and the prior work. There are many work involving a trade-off between reconstruction and KL-divergence for VAEs applications. For disentangled representation learning, researchers proposed β-VAE (β > 1) (Higgins et al., 2017; Burgess et al., 2018) that assigns a large and fixed hyperparameter, β, to put more emphasis on the KL divergence to encourage disentangled latent representations. It, however, sacrifice the reconstruction quality in order to obtain better disentangling. Then some follow-up work (Chen et al., 2018a; Kim & Mnih, 2018) further factorize the KL-divergence term to improve the reconstruction quality. However, these methods still assign a fixed and large hyperparameter to the decomposed terms in the objective, resulting in high reconstruction error. In contrast, Control VAE dynamically tunes β during optimization to achieve good disentangling and better reconstruction quality. More importantly, our PI control method can be used as a plug-in replacement of existing methods. In order to improve the sample generation quality of VAEs (Dai & Wipf, 2019; Zhao et al., 2019; Xiao et al., 2019; Ghosh et al., 2019; Alemi et al., 2017; Zhao et al., 2019), some researchers tried to reduce the weight of KLdivergence to make the decoder produce sharper outputs. Though they can obtain impressive sample quality, they suffer severely from the trade-off in the way that the latent distribution is far away from the prior. Recent studies adopted a constrained optimization for reconstruction error (Rezende & Viola, 2018; Klushyn et al., 2019) to achieve the trade-off between reconstruction error and KL-divergence. However, they may suffer from posterior collapse if the inference network fails to cover the latent space while our can totally avert posterior collapse. In addition, different from their work, we try to optimize KL-divergence (information bottleneck) as a constraint. Our method and theirs complement each other for different applications. In language modeling, VAE often suffers from KL vanishing, due to a powerful decoder, such as Transformer (Vaswani et al., 2017) and LSTM. To remedy this Control VAE: Controllable Variational Autoencoder issue, one popular way is to add a hyperparameter β on the KL term (Bowman et al., 2015; Liu et al., 2019), and then gradually increases it from 0 until 1. However, the existing methods (Yang et al., 2017; Bowman et al., 2015; Liu et al., 2019), such as KL cost annealing and cyclical annealing, cannot totally solve KL vanishing or explicitly control the value of KL-divergence since they blindly change β without observing the actual KL-divergence during model training. Conversely, our approach can avert KL vanishing and stabilize the KL-divergence to a desired value. 6. Conclusion and Future Work In this paper, we proposed a general controllable VAE framework, Control VAE, that combines automatic control with the basic VAE framework to improve the performance of the VAE models. We designed a new non-linear PI controller to control the value of KL divergence during model training. Then we evaluated Control VAE on three different tasks. The results show that Control VAE attains better performance; it can achieve a higher reconstruction quality and good disentanglement. It can totally avert KL vanishing and improve the diversity of generated data in language modeling. In addition, it improves the reconstruction quality on the task of image generation. For future work, we plan to apply our method to other research topics, such as topic modeling and semi-supervised applications. Acknowledgments Research reported in this paper was sponsored in part by DARPA award W911NF-17-C-0099, DTRA award HDTRA1-18-1-0026, and the Army Research Laboratory under Cooperative Agreements W911NF-09-2-0053 and W911NF-17-2-0196. Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., and Murphy, K. Fixing a broken elbo. ar Xiv preprint ar Xiv:1711.00464, 2017. Astr om, K. J., H agglund, T., and Astrom, K. J. Advanced PID control, volume 461. ISA-The Instrumentation, Systems, and Automation Society Research Triangle ..., 2006. Azar, A. T. and Serrano, F. E. Design and modeling of anti wind up pid controllers. In Complex system modelling and control through intelligent soft computations, pp. 1 44. Springer, 2015. Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transac- tions on pattern analysis and machine intelligence, 35(8): 1798 1828, 2013. Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. ar Xiv preprint ar Xiv:1511.06349, 2015. Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in beta-vae. ar Xiv preprint ar Xiv:1804.03599, 2018. Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610 2620, 2018a. Chen, X., Xu, C., Yang, X., and Tao, D. Attention-gan for object transfiguration in wild images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 164 180, 2018b. Dai, B. and Wipf, D. Diagnosing and enhancing vae models. ar Xiv preprint ar Xiv:1903.05789, 2019. Denton, E. L. et al. Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pp. 4414 4423, 2017. Ghosh, P., Sajjadi, M. S., Vergari, A., Black, M., and Sch olkopf, B. From variational to deterministic autoencoders. ar Xiv preprint ar Xiv:1903.12436, 2019. Godfrey, J. and Holliman, E. Switchboard-1 release 2linguistic data consortium. SWITCHBOARD: A User s Manual, 1997. Hellerstein, J. L., Diao, Y., Parekh, S., and Tilbury, D. M. Feedback control of computing systems. John Wiley & Sons, 2004. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. ICLR, 2(5):6, 2017. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., and Xing, E. P. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1587 1596. JMLR. org, 2017. Hu, Z., Shi, H., Tan, B., Wang, W., Yang, Z., Zhao, T., He, J., Qin, L., Wang, D., et al. Texar: A modularized, versatile, and extensible toolkit for text generation. In ACL 2019, System Demonstrations, 2019. Control VAE: Controllable Variational Autoencoder Jelinek, F., Mercer, R. L., Bahl, L. R., and Baker, J. K. Perplexity a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1):S63 S63, 1977. Kim, H. and Mnih, A. Disentangling by factorising. In International Conference on Machine Learning, pp. 2654 2663, 2018. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Klushyn, A., Chen, N., Kurle, R., Cseke, B., and van der Smagt, P. Learning hierarchical priors in vaes. In Advances in Neural Information Processing Systems, pp. 2866 2875, 2019. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann. Liu, M.-Y., Breuel, T., and Kautz, J. Unsupervised imageto-image translation networks. In Advances in neural information processing systems, pp. 700 708, 2017. Liu, X., Gao, J., Celikyilmaz, A., Carin, L., et al. Cyclical annealing schedule: A simple approach to mitigating kl vanishing. ar Xiv preprint ar Xiv:1903.10145, 2019. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730 3738, 2015. Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700 709, 2018. Marcus, M., Santorini, B., and Marcinkiewicz, M. A. Building a large annotated corpus of english: The penn treebank. 1993. Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites: Disentanglement testing sprites dataset. URL https://github. com/deepmind/dsprites-dataset/.[Accessed on: 2018-05-08], 2017. Peng, Y., Vrancic, D., and Hanus, R. Anti-windup, bumpless, and conditioned transfer techniques for pid controllers. IEEE Control systems magazine, 16(4):48 57, 1996. Rezende, D. J. and Viola, F. Taming vaes. ar Xiv preprint ar Xiv:1810.00597, 2018. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014. Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483 3491, 2015. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017. Wang, W., Gan, Z., Xu, H., Zhang, R., Wang, G., Shen, D., Chen, C., and Carin, L. Topic-guided variational autoencoders for text generation. ar Xiv preprint ar Xiv:1903.07137, 2019. Xiao, Z., Yan, Q., Chen, Y., and Amit, Y. Generative latent flow: A framework for non-adversarial image generation. ar Xiv preprint ar Xiv:1905.10485, 2019. Xu, J., Ren, X., Lin, J., and Sun, X. Dp-gan: Diversitypromoting generative adversarial network for generating informative and diversified text. ar Xiv preprint ar Xiv:1802.01345, 2018. Yan, X., Yang, J., Sohn, K., and Lee, H. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776 791. Springer, 2016. Yang, Z., Hu, Z., Salakhutdinov, R., and Berg-Kirkpatrick, T. Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3881 3890. JMLR. org, 2017. Zhao, S., Song, J., and Ermon, S. Infovae: Balancing learning and inference in variational autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5885 5892, 2019. Zhao, T., Zhao, R., and Eskenazi, M. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. ar Xiv preprint ar Xiv:1703.10960, 2017. Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1097 1100, 2018.