# recursive_disentanglement_network__8fac416e.pdf Published as a conference paper at ICLR 2022 RECURSIVE DISENTANGLEMENT NETWORK Yixuan Chen , Yubin Shi , Dongsheng Li {yixuanchen20, ybshi21}@fudan.edu.cn, dongsli@microsoft.com Yujiang Wang , Mingzhi Dong yujiang.wang14@imperial.ac.uk, mingzhidong@gmail.com Yingying Zhao , Robert Dick yingyingzhao@fudan.edu.cn, dickrp@umich.edu Qin Lv , Fan Yang , Li Shang qin.lv@colorado.edu, {yangfan, lishang}@fudan.edu.cn Disentangled feature representation is essential for data-efficient learning. The feature space of deep models is inherently compositional. Existing β-VAE-based methods, which only apply disentanglement regularization to the resulting embedding space of deep models, cannot effectively regularize such compositional feature space, resulting in unsatisfactory disentangled results. In this paper, we formulate the compositional disentanglement learning problem from an informationtheoretic perspective and propose a recursive disentanglement network (Recur D) that propagates regulatory inductive bias recursively across the compositional feature space during disentangled representation learning. Experimental studies demonstrate that Recur D outperforms β-VAE and several of its state-of-the-art variants on disentangled representation learning and enables more data-efficient downstream machine learning tasks. 1 INTRODUCTION Recent progress in machine learning demonstrates the ability to learn disentangled representations is essential for data-efficient learning, such as controllable image generation, image manipulation, and domain adaptation (Suter et al., 2019; Zhu et al., 2018; Peng et al., 2019; Gabbay & Hoshen, 2021; 2019). β-VAE (Higgins et al., 2017) and its variants are the most investigated approaches for disentangled representation learning. Recent works on β-VAE-based methods introduce various inductive biases as regularization terms and directly apply them on the resulting embedding space of deep models, such as the bottleneck capacity constraint (Higgins et al., 2017; Burgess et al., 2018), total correlation among variables (Kim & Mnih, 2018; Chen et al., 2018), and the mismatch between aggregated posterior and prior (Kumar et al., 2017), aiming to balance among representation capacity, independence constraints, and reconstruction accuracy. Indeed, as demonstrated by Locatello et al. (2020; 2019), unsupervised disentanglement is fundamentally impossible without explicit inductive biases on models and data sets. However, our study shows that existing β-VAE-based methods may not be able to learn satisfactory disentangled representations even for fairly trivial cases. This is due to the fact that the feature spaces of deep models have inherently compositional structures, i.e., each complex feature is a composition of primitive features. However, existing methods with regularization terms solely applied to the resulting embedding space cannot effectively propagate disentangled regularization across such compositional feature space. As shown in Figure 1, applying the standard β-VAE to the widely * China and Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University, Shanghai, China, Microsoft Research Asia, Shanghai, China, Department of Computing, Imperial College London, London, United Kingdom, Department of Electrical Engineering and Computer Science, University of Michigan, Michigan, United States, Department of Computer Science, University of Colorado Boulder, Boulder, United States, || School of Microelectronics, Fudan University, Shanghai, China, ** The corresponding author. Published as a conference paper at ICLR 2022 Encoder Decoder (c) Figure 1: Illustration of the negative impact on ignoring the compositional structure of the representation space using d Sprites dataset: (a) Illustration of z (from embedding space) and m (from intermediate layers). (b) z is not disentangled sufficiently when m is not disentangled sufficiently. (c) The disentanglement of z improves, as that of m improves. used dataset d Sprites (Matthey et al., 2017), we visualize the resulting representation z, as well as its compositional low-level representations m extracted from the previous layer (as shown in Figure 1(a)), and evaluate the independence between each pair of m and each pair of z, respectively . Figure 1(b) and Figure 1(c) show that the disentanglement quality of low-level features m may impact the resulting representation z in terms of disentanglement quality. This study demonstrates the potential benefit to regularize the compositional feature space of deep models during disentangled representation learning. This work aims to tackle the compositional disentanglement learning problem. First, we formulate disentangled representation learning from an information-theoretic perspective, and introduce a new learning objective covering three essential properties for learning disentangled representations: sufficiency, minimal sufficiency, and disentanglement. Theoretical analysis shows that the proposed learning objective is a general form of β-VAE and several of its state-of-the-art variants. Next, we extend the proposed learning objective to cover the disentanglement representation learning problem in the compositional feature space. Governed by the proposed learning objective, we present Recursive Disentanglement Network (Recur D), a compositional disentanglement learning method, which directs the disentanglement learning process across the compositional feature space by applying regulatory inductive bias recursively through the feed-forward network. We argue that the recursive propagation of inductive bias through the feed-forward network imposes a sufficient condition of disentangled representation learning. Empirical studies demonstrate that Recur D outperforms β-VAE (Higgins et al., 2017) and several other variants of VAE (Burgess et al., 2018; Kim & Mnih, 2018; Chen et al., 2018; Kumar et al., 2017) on disentangled representation learning and achieves more data-efficient learning in downstream machine learning tasks. 2 COMPOSITIONAL DISENTANGLEMENT LEARNING In this section, we first formulate disentanglement learning from the information-theoretic perspective by introducing three key properties and show such formulation is a general form of the optimization objectives of β-VAE and several of its variants. Next, we extend the principled objective to the compositional feature space to tackle the compositional disentanglement learning problem. 2.1 DISENTANGLEMENT LEARNING OBJECTIVE The challenge of representation learning can be formulated as finding a distribution p (z|x) that maps original data x X into a representation z with fixed amount of variables z = {z1, . . . , zn} (Bengio et al., 2013). The key intuition of z is to capture minimal sufficient information in a disentangled manner, given the reconstruction task x ˆx. We denote the representation learning process as a Markov Chain for which ˆx x z, which means z depends on ˆx only through x, i.e., p (z|x) = p (z|x, ˆx) (see also, (Cover, 1999; Achille & Soatto, 2018)). The principled properties of z are defined as follows: Definition 1. Sufficiency: a representation z of x for ˆx is sufficient if I (x, ˆx) = I (z, ˆx). The independence between two components ci and cj is measured by the normalized mutual information (Chen et al., 2018). Whenever NMI(ci; cj) = I(ci; cj)/H(c) = 0, ci and cj are independent (disentangled). Published as a conference paper at ICLR 2022 For the reconstruction task, z is sufficient if it can successfully reconstruct x by ˆx. The difference between I (x; ˆx) and I (z; ˆx) is computed as follows: I (x; ˆx) I (z; ˆx) = I (x; ˆx|z) = H (ˆx|z) H (ˆx|x). Given the reconstruction task x ˆx, H (ˆx|x) is constant and independent to z, so the sufficient property can be optimized by minimizing H (ˆx|z) (Federici et al., 2020; Dubois et al., 2020). Definition 2. Minimal Sufficiency: a representation z of x is minimal sufficient if I (x; z) = I (z; ˆx). A minimal sufficient z encodes the minimum amount of information about x required to reconstruct ˆx (Cover, 1999; Achille & Soatto, 2018). Since I (z; ˆx) equals to I (x; ˆx) when z is sufficient, the difference is computed as I (x; z) I (z; ˆx) = I (x; z) I (x; ˆx). Given the reconstruction task x ˆx, I (x; ˆx) is constant and independent to z, so the minimal sufficiency property can be optimized by minimizing I (x; z). Definition 3. Disentanglement: a representation denoted as z = {z1, . . . , zn} is disentangled if P j =i I (zi; zj) = 0. From the definition of mutual information, I (zi; zj) = H (zi) H (zi|zj) denotes the reduction of uncertainty in zi when zj is observed (Cover, 1999). If any two components zi and zj are disentangled, changes to zi have no influence on zj, which means I (zi; zj) = 0. A representation satisfying all these properties can be found by introducing two Lagrange multipliers λ1 and λ2 for two constrained expected properties with respect to the fundamental sufficiency property. The principled objective of disentanglement learning is to minimize the following objective: L = H (ˆx|z) + λ1I (x; z) + λ2 X j =i I (zi, zj) . (1) The above objective can be interpreted as the reconstruction error, plus two regularizers that yield an optimally disentangled representation. The principled objective also helps us analyze and understand the success of recently developed β-VAE-based methods. These methods operate with an encoder with parameter φ and a decoder with parameter θ, to induce the joint distributions q (x, z) = qφ (z|x) q (x) and p (x, z) = pθ (x|z) p (z), respectively, where p (z) is a fixed prior distribution. The learning objective of β-VAE contains the reconstruction error and KL divergence between the variational posterior and prior. To understand the relationship of learning objectives between Equation 1 and β-VAE-based methods, we decompose I (x; z) (Kim & Mnih, 2018) and estimate an upper bound for P j =i I (zi, zj) (Te Sun, 1980; 1975), then we assign different weights as follows: λ1I (x; z) + λ2 X j =i I (zi, zj) λa Ex [KL (q (z|x) p (z))] + λb KL j=1 KL (q (zj) p (zj)) . (2) As shown in Table 1, the learning objectives of β-VAE and its four variants can be regarded as specific cases of Equation 1, i.e., they assign different weights to our regularization terms, which can balance among latent variables capacity, independence constraints, and reconstruction accuracy, leading to successful disentangled representation learning (Zhao et al., 2017; Li et al., 2020). More details can be found in Appendix B. However, in these works, the inductive bias toward disentanglement is only applied to the embedding space of z, ignoring the need of disentanglement during feature composition in feed-forward networks. Table 1: Deriving the learning objectives of β-VAE and its four variants as specific cases of Equation 1. Method β-VAE Factor VAE β-TCVAE DIP-VAE Info VAE Weight λa = β. λa = 1, λa = 1, λa = 1, λa = 1 α, Relation λb = γ. λb = β. λb = λc = λ. λb + λc = α + λ 1. 2.2 COMPOSITIONAL OBJECTIVE Considering an encoder with L layers to encode original data x into disentangled representation z. Let us denote ml as the input features of the l-layer, which are divided into groups of features, i.e., ml = jml j, where ml j is the j-th feature subset. Published as a conference paper at ICLR 2022 We formulate the compositional relation between features from two consecutive layers as follows: ml+1 j = Layer ml wl j . Layer can be applicable to commonly used neural network layers, e.g., the convolution layer in computer vision tasks. The compositional relation is achieved by a composition matrix wl Rdl dl+1, so that features in ml are divided into dl+1 groups after passing through all compositional vectors (wl js). Note that ml+1 j is only related to the subset of features from ml selected by wl j. Similar to Section 2.1, we assume that the learning process of ml+1 depends on ml through original data x, denoted as the Markov chain: ml x ml+1. It can be written as ml+1 x ml according to the conditional independence implied by the Markovity (Cover, 1999). Consider two output features ml+1 i , ml+1 j ml+1 and their corresponding input feature subsets ml i, ml j ml, we define key notions as follows: Definition 4. Compositional Disentanglement: ml i and ml j are disentangled if I ml i; ml j = 0 . A disentangled representation of ml i and ml j may improve the disentanglement quality between ml+1 i and ml+1 j . Similar to Definition 3, we can achieve compositional disentanglement by minimizing I ml i; ml j . Definition 5. Compositional Minimal Sufficiency: Assume that the learning process of ml+1 j is denoted by the Markov chain: ml+1 j x ml i, ml j . Given the original data x, an input feature set ml j for the output feature ml+1 j is minimal sufficient if I x; ml+1 j = I ml j; ml+1 j . For the output feature ml+1 j , the input feature set ml j is sufficient and another input feature set ml i is superfluous when ml j is able to capture all information of ml+1 j as well as the original data x. Furthermore, according to Data-Processing Inequality (DPI) (Cover, 1999; Achille & Soatto, 2018) in the Markov chain, there exists an inequality that: I x; ml+1 j I ml+1 j ; ml j + I ml+1 j ; ml i I ml i; ml j , (3) where the difference between I x; ml+1 j and I ml+1 j ; ml j is equivalent to the difference between I ml+1 j ; ml i and I ml i; ml j . Therefore, matching I ml+1 j ; ml i to I ml i; ml j can yield a minimal sufficient representation ml j for ml+1 j . Based on the definition of compositional disentanglement, we can optimize the minimal sufficiency by forcing I ml+1 j ; ml i to be 0. More details can be found in Appendix B. To learn disentangled representation via effectively regularizing the compositional feature space, we augment the principled learning objective (Equation 1) with compositional regularizers. Therefore, the compositional learning objective for disentangled representation is defined as follows: L = H ˆx|m L+1 | {z } sufficient j =i I ml i; ml+1 j | {z } minimal sufficient j =i I ml i, ml j | {z } disentangled where m L+1 denotes the final disentangled representation z. Our intuition is that disentangled learning for compositional feature space could benefit the disentanglement learning for high-level representations. 3 RECURSIVE DISENTANGLEMENT NETWORK We now describe a learning method with the goal of optimizing the compositional disentanglement learning objective. This method, called Recursive Disentanglement Network (Recur D), propagates inductive bias (disentanglement) recursively across the compositional feature space. 3.1 MODEL ARCHITECTURE As shown in Figure 2, Recur D contains an encoder and a decoder to learn the disentangled representation z of data x and to reconstruct ˆx, where the encoder contains multiple Recursive Modules and the decoder is a Deconvolutional Neural Network (Zeiler & Fergus, 2014). Published as a conference paper at ICLR 2022 Figure 2: Recursive Disentanglement Network The first Recursive Module of the encoder is implemented by a multi-channel convolutional network (Lawrence et al., 1997) to encode the original image x. Following the notation in Section 2.2, the output of the 1-st Recursive Module is denoted as m2, which is also the input of 2-nd Recursive Module. As for the l-th (l 2) Recursive Module, it contains (1) a Router R to learn a composition matrix wl from the input features ml to decompose ml into subsets; and (2) a Groupof-Encoders (Go E) layer consisting of n encoders to induce the output feature ml+1. Inspired by the Gate of Mixture-of-Experts (Shazeer et al., 2017; Fedus et al., 2021) on parameter selection for each input, we present a Router with Top K to learn the compositional relation. In detail, the Router takes ml as input and compute composition matrix wl via learning similarity: wl = softmax Top K R ml , k , R ml = softmax ml ml T v, and Top K R ml , k = R ml ij , if R ml ij is in the top k elements of R ml j , otherwise. Here, R (m) denotes the similarity matrix and v is a learning matrix learned by a linear layer. denotes Hadamard product. Top K is the compositional strategy to determine whether input feature ml ij belongs to the input feature set of ml+1 j , and k is a hyperparameter. The Go E layer consists of dl+1 parallel encoders Enc1, . . . , Encdl+1 to generate the output features ml+1, where each encoder is implemented by a convolutional neural network with specific parameters: ml+1 j = Encj ml wl j . (6) Then, ml+1 is obtained by concatenating the outputs from each encoder, which is converted directly to the input of the l + 1-th Recursive Module. Note that the output of L-th Recursive Module is the learned disentangled representation z. Finally, the Decoder Dec takes z as input to obtain the reconstructed input ˆx: ˆx = Dec (z). 3.2 LEARNING OF RECURD As mentioned in Equation 4, the learning objective of Recur D governs the recursive disentanglement learning across the compositional feature space. As shown by related works, a precise estimation of mutual information in the high dimensional space is important for accurately estimating the loss. According to the recent progresses, the mutual information between two representations can be maximized by using any sample-based differentiable mutual information lower bound. Similar to the work of Federici et al. (2020); Hjelm et al. (2019); Wen et al. (2020), we utilize MINE estimators (Belghazi et al., 2018) to estimate the mutual information. This approach introduces an auxiliary parametric model Estimator Net which is jointly optimized during the training procedure using re-parametrized samples from the posterior distribution. 4 RELATED WORK β-VAE-based methods introduce various regularization terms and directly apply them on the resulting embedding space. Higgins et al. (2017) proposed β-VAE method by introducing a constraint β over the KL divergence between the inferred distribution qφ(z|x) and its prior an isotropic unit Gaussian. Burgess et al. (2018) found that the network specializes in the factor that contributes most to a small reconstruction error, with a limited channel capacity. Thus, they proposed the Annealed VAE, an extension of β-VAE, which gradually adds more latent encoding capacity by enforcing the KL divergence to be at a controllable value C. Kim & Mnih (2018) analyzed the disentanglement Published as a conference paper at ICLR 2022 performance of β-VAE by breaking down the regularization term into two components. One is Total Correlation (TC) (Watanabe, 1960), which encourages the marginal distribution of representations to be factorial. The other is mutual information I(x; z), which reduces the amount of information about x stored in z. Based on the decomposition, they proposed Factor VAE to relax the regularization of I(x; z) while directly penalizing the TC term. Similarly, Chen et al. (2018) proposed β-TCVAE by decomposing the TC penalty as the dependence among variables and the distance between each variable s posterior and prior. Since TC is intractable, both Factor VAE and β-TCVAE use approximation methods. Another TC-related approach is DIP-VAE (Kumar et al., 2017), whose designers argued that the estimation of TC requires additional parameters and suffers from vanishing gradients. Therefore, DIP-VAE optimizes the moments distance between the aggregated posterior and a factorized prior instead of estimating TC. The above methods improved over β-VAE by applying inductive biases directly to the high-level latent variable space. However, the feature space of deep models is compositional in nature. Existing variants of β-VAE cannot effectively apply disentanglement regularization across such compositional feature space, and yield inferior disentanglement in representation learning. Our approach diverges from prior works by taking a principled information-theoretic approach to formulate and analyze the compositional disentanglement feature structure. Recent works on hierarchical VAEs introduce layer-wise disentanglement regularization to learn conditioning structures across multi-layer latent variables, such as Vamp Prior (Tomczak & Welling, 2018), Ladder VAEs (Sønderby et al., 2016), and NVAE (Vahdat & Kautz, 2020). In these hierarchical model structures, e.g., the cross-layer residual connection structure in NVAE and Vamp Prior, inter-layer regularization is less of a focus. The latent variables of a preceding layer serve as shared inputs to the next layer, which introduces information redundancy and hence impairs representation disentanglement. In contrast, the proposed compositional objective optimizes the statistical independence of inter-layer and intra-layer latent variables simultaneously, thereby minimizing the information redundancy of inter-layer information sharing and improving disentanglement quality. 5 EXPERIMENTS This section presents both quantitative and qualitative experiments to evaluate Recur D in terms of disentanglement quality and data efficiency on downstream tasks. 5.1 PERFORMANCE OF DISENTANGLEMENT LEARNING We compare the disentanglement learning performance of Recur D with β-VAE and its state-of-theart variants, including β-VAE (Higgins et al., 2017), Annealed-VAE (Burgess et al., 2018), Factor VAE (Kim & Mnih, 2018), β-TCVAE (Chen et al., 2018), DIP-VAE (Kumar et al., 2017). More experimental results can be found in Appendix E. Datasets: We consider two datasets in which each image is obtained by a deterministic function of ground-truth factors: d Sprites (Matthey et al., 2017) and 3DShapes (Burgess & Kim, 2018). d Sprites contains 737, 280 binary 64 64 images of 2D shapes with 5 ground truth factors, i.e., 3 shapes, 6 scales, 40 orientations, 32 x-positions, 32 y-position. 3DShapes contains 480, 000 RGB 64 64 3 images of 3D shapes with 6 ground truth factors, i.e., 4 shapes, 8 scales, 15 orientations, 10 floor hues, 10 wall hues, 10 object hues. For all experiments, we use a 9:1 training to testing data ratio, following earlier work (Kumar et al., 2017; Locatello et al., 2019). More details on network architecture and hyperparameter settings are included in Appendix D. Evaluation Metrics for Disentanglement: There is no standard metric for evaluating disentanglement (Zhou et al., 2021; Ridgeway & Mozer, 2018), and most existing metrics involve the estimation of a variable-factor matrix relating the factors of variation to the learned representations. In the experiments, we consider three widely used metrics: Separated Attribute Predictability (SAP) (Kumar et al., 2017), Mutual Information Gap (MIG) (Chen et al., 2018) and Disentanglement, Completeness, and Informativeness (DCI) (Eastwood & Williams, 2018). DCI contains three metrics for disentanglement (DCI-D), Completeness (DCI-C) and Informativeness (DCI-I), respectively. Due to the space limitation, we present the results of DCI-D and include the results on DCI-C and DCII in the Appendix E. Overall, these metrics can comprehensively evaluate Recur D from different disentanglement measurements. Published as a conference paper at ICLR 2022 5.1.1 QUANTITATIVE RESULTS For quantitative analysis, we conduct three sets of experiments: 1) evaluating the average performance of disentanglement score via three evaluation metrics; 2) measuring the trade-off among reconstruction, minimal sufficiency and disentanglement with mini-batch samples; and 3) analyzing the influence of compositional objective for disentanglement learning via the three properties. Table 2 compares the reconstruction error and three widely-used disentanglement metrics of Recur D and five other methods on the d Sprites and 3DShapes datasets. Compared with all the baselines, Recur D achieves much lower reconstruction error as well as higher SAP, MIG, and DCI scores in most of the cases. It is important to point out that the reconstruction error of β-VAE increases as β increases (stronger disentanglement regularization), which indicates that β-VAE comprises reconstruction error for disentanglement. However, Recur D achieves better in both reconstruction and disentanglement, demonstrating the superiority of the proposed compositional learning objective and the recursive disentanglement network. Figure 3 shows scatter plots relating Minimal Sufficiency and Disentanglement to reconstruction error (Sufficiency), where each point represents a mini-batch of data. Here, we compute the minimal sufficiency score using I(x; z) and present it by the area of each point. In all baseline methods, we observe that smaller scatters are correlated with higher reconstruction errors and higher disentanglement scores. The reason is that β-VAE and its variants only place disentanglement regularization on the embedding space of z which imposes a limit on the capacity of the information channel (Higgins et al., 2017), so that features which are important for reconstruction but harmful for disentanglement may lose during training, leading to compromised reconstruction quality. However, Recur D reveals the necessity of disentanglement during feature composition in feed-forward networks and induces effective information compression. Therefore, compared to all the baselines, the representations from Recur D obtain better disentanglement and minimal sufficiency without degrading reconstruction quality. Figure 4 shows the impact of the number of Recursive Modules in disentanglement learning. We evaluate three Recur D variants on the principled properties, including Recur D 0, Recur D 1 and Recur D 2 with 1, 2 and 3 Recursive Modules, respectively. And we report the values correspond- Table 2: Three widely used evaluation scores and reconstruction error on the test sets for d Sprites and 3Dshapes. Boldface indicates the best results, i.e., reconstruction error or disentanglement scores. dataset d Sprites 3DShapes Reconst. Error SAP MIG DCI-D Reconst error SAP MIG DCI-D β-VAE(β = 4) 0.0066 0.0284 0.2617 0.1191 0.0216 0.1463 0.2519 0.4682 β-VAE(β = 16) 0.0094 0.0242 0.2241 0.1820 0.0544 0.1488 0.2629 0.4528 β-VAE(β = 60) 0.0127 0.0445 0.1432 0.1291 0.0624 0.1041 0.2402 0.4317 Annealed-VAE 0.0171 0.0311 0.1177 0.1449 0.0811 0.0730 0.2217 0.4279 Factor-VAE 0.0228 0.0436 0.2594 0.1955 0.0800 0.1331 0.2630 0.4491 β-TCVAE 0.0162 0.0352 0.1585 0.1774 0.0312 0.0364 0.2070 0.4487 DIP-VAE 0.0213 0.0261 0.0731 0.1038 0.0213 0.2013 0.3108 0.4853 Recur D 0.0047 0.0502 0.2707 0.3841 0.0083 0.1979 0.3105 0.5804 Figure 3: Reconstruction error vs. disentanglement performance. Scatters located at the left top indicate better performance. The area of each scatter represents the minimal sufficiency score estimated by I(x; z), and smaller area indicates better performance. Published as a conference paper at ICLR 2022 ing to the three terms in our compositional learning objective on the training sets. We observe that Recur D 1 and Recur D 2 perform much better than Recur D 0 on the optimization of minimal sufficiency and disentanglement. In addition, during the early stage of training, the performance of Recur D 2 improves much faster than Recur D 1. This experiment demonstrates that the recursive propagation of inductive bias through the feed-forward network improves disentangled representation learning. 5.1.2 QUALITATIVE RESULTS Figure 4: The performance of Recur D with varying number of recursive modules on the principled properties. For qualitative analysis, we present the latent traversals of Recur D on two datasets in Figure 5, in which we vary a single variable learned by an encoder in Go E while keeping all others fixed. For images in 3DShapes, the latent traversals show that Recur D is able to successfully capture all the six factors of variation. For images in d Sprites, we observe that Recur D is able to discover x-position, y-position and scale (continuous variables). More importantly, Recur D can, to some extent, discover shape and orientation (discrete variables), which have been proved to be struggling for many other methods (Kumar et al., 2017; Kim & Mnih, 2018; Locatello et al., 2019). The reason is that β-VAE and its variants encourage independence among variables via controlling the information capacity of z and matching the posterior p(z|x) to an isotropic unit Gaussian. However, learning discrete variables would require using a discrete prior instead of Gaussian (Kim & Mnih, 2018). On the other hand, Recur D is able to model both discrete and continuous factors by directly disentangling based on the three principled properties on the entire compositional feature space, leading to stronger representation capability. 5.1.3 ABLATION AND PARAMETER DEPENDENCE STUDY In this section, we evaluate the impacts of hyperparameters in both learning objective and model architecture. First, we study the impact of compositional learning objective with varying regularization coefficients (λ1 and λ2) of minimal sufficiency and disentanglement. Then, we evaluate the influence of hyperparameter k the group size in the Go E of Recursive Module. Figure 6 (Recur D with varying λ1, λ2 and k) shows the scatter plots of disentanglement score (SAP) along with reconstruction error on the test sets of d Sprites and 3DShapes. As shown in Figure 6, larger penalties on both minimal sufficiency and disentanglement yield higher disentanglement score and lower reconstruction error, demonstrating the importance of the minimal sufficiency term and the disentanglement term in Equation 1. Note that Recur D with λ2 = 0 is Figure 5: First row: original images. Second row: reconstructions. Remaining rows: reconstructions of latent traversals. Figure 6: Ablation study on λ1, λ2 and group size k. Note that Recur D with λ2 = 0 reduces to β-VAE with the compositional architecture. Published as a conference paper at ICLR 2022 reduced to a standard β-VAE with the compositional architecture (as I(x; z) is an lower bound of KL(p(z|x) p(z))), which underperforms Recur D, indicating that an explicit penalty on disentanglement is important for disentangled representation learning. As for the group size in the Go E k, it is not surprising that when k increases, Recur D has lower reconstruction errors, with a slightly inferior disentanglement performance. The reason is that a dense composition of feature space can help the model maintain sufficient information but make it difficult to disentangle latent variables. Recur D with k = 1 results in poor performance on both reconstruction and disentanglement, indicating that over-emphasizing decomposition in the feature space may fail to preserve sufficient information. More results are presented in the Appendix E. 5.2 PERFORMANCE OF DOWNSTREAM TASKS Figure 7: Performance comparison of Recur D and baselines on the standard classification task (left) and the domain generalization task (right). Each model is trained by varying ratios of training data. The dotted line represents the performance of a single-layer neural network. In this section, we compare the performance of Recur D and five baseline methods by measuring their data efficiency on two downstream tasks: a standard classification task on the MNIST dataset and a domain generalization task on the MNIST-Rotation dataset (Ghifary et al., 2015). MNIST-Rotation is a synthetic dataset consisting of 6 domains, each containing 1, 000 images of the 10 digits randomly selected from the training set of MNIST, with 6 rotation degrees: 0 , 15 , 30 , 45 , 60 and 75 . For both tasks, we set the training sets of different proportions: 20%, 40%, 60%, 80%, and 100%, to evaluate the classification performance of Recur D and the baselines with different amounts of training data. For the domain generalization problem, we follow the previous works (Li et al., 2017; Balaji et al., 2018; Du et al., 2020) with the same train-test split strategy and the leave-one-domain-out strategy, i.e., we take the samples from one domain as the target domain for testing, and the samples from the remaining domains as the source domain for training. Figure 7 reports the average accuracy of different methods on the same test set. We can observe that all methods achieve decent data efficiency on both tasks, i.e., without much performance degradation even with 20% of training data, suggesting that the learning process of disentangled representation can more effectively capture information from the inputs. However, β-VAE and its variants do not achieve satisfactory classification accuracy for either task compared to a single-layer neural network. The reason is that β-VAE and its variants obtain the disentangled representation by limiting the capacity of information channels, thus they tend to only maintain features that contribute more to disentanglement, which may lose informative features that contribute more to classification. Compared to the baselines, Recur D achieves consistently better performance in both tasks, especially on the harder domain generalization task. The reason is that Recur D can learn disentangled representations without sacrificing the reconstruction performance, confirming the hypothesis that compositional disentanglement learning yields better generalization and more data-efficient representations. We believe that the informative disentangled representations emerge when the right balance is achieved between sufficient information preservation and minimal sufficient information learned in a disentangled manner. 6 CONCLUSION This paper has described a solution to the compositional disentangled representation learning problem. We first presented a general information-theoretic formulation of disentanglement representation learning, and then extended it to the compositional feature space. We then described Recur D, a recursive disentanglement network, which propagates regulatory inductive bias recursively across the compositional feature space. Recur D outperforms β-VAE and its state-of-the-art variants on disentangled representation learning and achieves more data-efficient learning in downstream tasks. Published as a conference paper at ICLR 2022 Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947 1980, 2018. Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. Metareg: Towards domain generalization using meta-regularization. Advances in Neural Information Processing Systems, 31: 998 1008, 2018. Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International Conference on Machine Learning, pp. 531 540. PMLR, 2018. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Chris Burgess and Hyunjik Kim. 3d shapes dataset. https://github.com/deepmind/3dshapes-dataset/, 2018. Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in beta-vae. ar Xiv preprint ar Xiv:1804.03599, 2018. Ricky TQ Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. ar Xiv preprint ar Xiv:1802.04942, 2018. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2180 2188, 2016. Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. Yingjun Du, Jun Xu, Huan Xiong, Qiang Qiu, Xiantong Zhen, Cees GM Snoek, and Ling Shao. Learning to learn with variational information bottleneck for domain generalization. In European Conference on Computer Vision, pp. 200 216. Springer, 2020. Yann Dubois, Douwe Kiela, David J Schwab, and Ramakrishna Vedantam. Learning optimal representations with the decodable information bottleneck. ar Xiv preprint ar Xiv:2009.12789, 2020. Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, 2018. Marco Federici, Anjan Dutta, Patrick Forr e, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. ar Xiv preprint ar Xiv:2002.07017, 2020. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. ar Xiv preprint ar Xiv:2101.03961, 2021. Sanja Fidler, Sven Dickinson, and Raquel Urtasun. 3d object detection and viewpoint estimation with a deformable 3d cuboid model. Advances in neural information processing systems, 25: 611 619, 2012. Aviv Gabbay and Yedid Hoshen. Demystifying inter-class disentanglement. ar Xiv preprint ar Xiv:1906.11796, 2019. Aviv Gabbay and Yedid Hoshen. Scaling-up disentanglement for image translation. ar Xiv preprint ar Xiv:2103.14017, 2021. Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision, pp. 2551 2559, 2015. Published as a conference paper at ICLR 2022 Irina Higgins, Lo ıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In 5th International Conference on Learning Representations, ICLR 2017, 2017. R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. ICLR, 2019. Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, pp. 2649 2658. PMLR, 2018. Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. ar Xiv preprint ar Xiv:1711.00848, 2017. Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks, 8(1):98 113, 1997. Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542 5550, 2017. Yanjun Li, Shujian Yu, Jose C Principe, Xiaolin Li, and Dapeng Wu. Pri-vae: principle-of-relevantinformation variational autoencoders. ar Xiv preprint ar Xiv:2007.06503, 2020. Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114 4124. PMLR, 2019. Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar R atsch, Sylvain Gelly, Bernhard Sch olkopf, and Olivier Bachem. A sober look at the unsupervised learning of disentangled representations and their evaluation. ar Xiv preprint ar Xiv:2010.14766, 2020. Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017. William Mc Gill. Multivariate information transmission. Transactions of the IRE Professional Group on Information Theory, 4(4):93 111, 1954. Xingchao Peng, Zijun Huang, Ximeng Sun, and Kate Saenko. Domain agnostic learning with disentangled representations. In International Conference on Machine Learning, pp. 5102 5112. PMLR, 2019. Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. Advances in neural information processing systems, 28:1252 1260, 2015. Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the f-statistic loss. ar Xiv preprint ar Xiv:1802.05312, 2018. Huajie Shao, Shuochao Yao, Dachun Sun, Aston Zhang, Shengzhong Liu, Dongxin Liu, Jun Wang, and Tarek Abdelzaher. Controlvae: Controllable variational autoencoder. In International Conference on Machine Learning, pp. 8655 8664. PMLR, 2020. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ar Xiv preprint ar Xiv:1701.06538, 2017. Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. Advances in neural information processing systems, 29:3738 3746, 2016. Raphael Suter, Djordje Miladinovic, Bernhard Sch olkopf, and Stefan Bauer. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In International Conference on Machine Learning, pp. 6056 6065. PMLR, 2019. Published as a conference paper at ICLR 2022 Han Te Sun. Linear dependence structure of the entropy space. Inf Control, 29(4):337 68, 1975. Han Te Sun. Multiple mutual informations and multiple interactions in frequency data. Information and Control, 46:26 45, 1980. Jakub Tomczak and Max Welling. Vae with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214 1223. PMLR, 2018. Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. ar Xiv preprint ar Xiv:2007.03898, 2020. Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development, 4(1):66 82, 1960. Liangjian Wen, Yiji Zhou, Lirong He, Mingyuan Zhou, and Zenglin Xu. Mutual information gradient estimation for representation learning. ar Xiv preprint ar Xiv:2005.01123, 2020. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818 833. Springer, 2014. Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. ar Xiv preprint ar Xiv:1706.02262, 2017. Sharon Zhou, Eric Zelikman, Fred Lu, Andrew Y. Ng, Gunnar E. Carlsson, and Stefano Ermon. Evaluating the disentanglement of deep generative models through manifold topology. In International Conference on Learning Representations, 2021. Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Joshua B Tenenbaum, and William T Freeman. Visual object networks: Image generation with disentangled 3d representation. ar Xiv preprint ar Xiv:1812.02725, 2018. Published as a conference paper at ICLR 2022 A PROPERTY OF MARKOV CHAIN AND MUTUAL INFORMATION This section lists the related properties of Markov chain and mutual information used in this work. Markov Chain Consider three random variables a, b and c coming from a joint distribution p(a, b, c), if the conditional distribution of c depends only on a and is conditionally independent of b, the variables can form a Markov chain in the order denoted as: b a c. Specially, the joint distribution can be written as (Cover, 1999): p(a, b, c) = p(a)p(b|a)p(c|a). (7) Markovity implies a conditional independence between c and b when a is observed, the reason is: p(c, b|a) = p(a, b, c) p(a) = p(c, a)p(b|a) p(a) = p(c|a)p(b|a). (8) By rewriting the conditional independence, the Markov chain of b a c also implies c a b as: p(b, c|a) = p(a, b, c) p(a) = p(b, a)p(c|a) p(a) = p(c|a)p(b|a). (9) Mutual Information 1. Positivity: I(a; b) 0; I(a; b|c) 0. 2. Chain Rule: I(a; b, c) = I(a; b) + I(a; c|b) 3. I(a; b) = H(a) H(a|b) = 0 if and only if a and b are independent. I(a; b) = H(a) H(a | b) = E log 1 p(a) E log 1 p(a|b) = E log p(a|b) p(a) p(b) p(b) = E log p(a, b) = D (p(a, b) p(a)) p(b)) 0 4. Data-Processing Inequality (DPI): If three random variables a, b and c coming from a joint distribution p(a, b, c) can form a Markov chain in the order denoted as: b a c, then I(b; a) I(b; c). (Theorem 2.8.1 in Cover (1999)) B LEARNING OBJECTIVES DECOMPOSITION In this section, we decompose the proposed learning objective to analyze the relationship of learning objectives between Equation 1 and β-VAE-based methods. Let p(x) denote the true distribution of the data, and pφ(z|x) and pθ(x|z) denote the unknown distributions that we need to estimate, parametrized by an encoder with φ and a decoder with θ. B.1 DECOMPOSITION OF MINIMAL SUFFICIENCY Minimal Sufficiency is defined as z can encode the minimum amount information of x required to reconstruct ˆx, optimized by minimize I(x; z). Inspired from the work of Chen et al. (2018), we can Published as a conference paper at ICLR 2022 decompose the minimal sufficiency by assuming the prior p(z) as a factorized Gaussian as: I (x; z) = Eq(x,z) [KL (q (x, z) q (z) p (x))] log q (x, z) q (z) p (x) log q (z|x) p (x) q (z) p (x) log q (z|x) log q (z) + log p (z) log p (z) + log Y j q (zj) log Y log q (z|x) j log q (zj) log q (z) Q = Ep(x) [KL (q (z|x) p (z))] X j KL (q (zj) p (zj)) KL The first term is KL divergence between inferred posterior and prior, usually as a penalty term of β-VAEs for disentangling. The second term is the dimension-wise KL divergence, which represents the distance between each latent dimension to the prior. The third term is the Total Correlation referring to the independence among variables. B.2 ESTIMATION OF DISENTANGLEMENT Disentanglement is defined as the independence between any two variables, optimized by minimizing P i =j I(zi; zj), which is an lower bound of Total Correlation. The concept of Total Correlation (TC) was described by Mc Gill (1954) and formally formulated by Watanabe (1960) to evaluate the mutual independence of multi-variant variables. Assume a set of z of random variables z1, . . . , zn. The TC in z is expressed as: C (z1, . . . , zn) = i=1 H (zi) H (z1, z2, . . . , zn) , (12) where H (z1, z2, . . . , zn) denotes the joint entropy. Furthermore, according to the work of Te Sun (1980; 1975), the relation between TC and mutual information can be described as: i=1 H (zi) H (z) = X n 2 H (z) (13) The general H (z) is defined as Fano s multiple mutual information among variables. In the case of n = 2, it reduces to Shannon s mutual information and in the case of n = 3, it coincides with Mc Gill s mutual information. The formulation of TC can be simplified as: C (z1, . . . , zn) = X j =i I (zi; zj) + X k =j =i H (zi; zj; zk) + . . . + H (z1; . . . ; zn) (14) In VAEs, to measure dependence for multiple latent variables, TC is computed as DKL q (z) Q j q (zj) . Therefore, the proposed disentanglement P i =j I (zi; zj) is a lower bound of TC. Published as a conference paper at ICLR 2022 B.3 RELATION WITH OTHER WORKS After decomposing the minimal sufficiency and estimating the disentanglement by TC, we can combine two regularizers as: λ1I (x; z) + λ2 X j =i I (zi; zj) λ1Ep(x) [KL (q (z|x) p (z))] λ12 X j KL (q (zj) p (zj)) λ13 KL j KL (q (zj) p (zj)) = λ1Ep(x) [KL (q (z|x) p (z))] + (λ2 λ12) X j KL (q (zj) p (zj)) λ13 KL = λa Ep(x) [KL (q (z|x) p (z))] + λb X j KL (q (zj) p (zj)) + λc KL (15) Therefore, by assigning different weights to the decomposed term, we can establish the relationship between the proposed information-theoretic objective and existing β-VAE variants, which will provide insights regarding the capabilities and limitations of existing methods, and further motivate the proposed work. Specially, the objective function of β-TCVAE is I (x, z) + KL q (z) Q j q (zj) + P j KL (q (zj) p (zj)). If we decompose the I (x; z) of β-TCVAE as: KL (q (z|x) p (z)) KL q (z) Q j q (zj) P j KL (q (zj) p (zj)), the regularizer term of β-TCVAE can be writ- ten as: αKL (q (z|x) p (z))+(β α) KL q (z) Q j q (zj) +(γ α) P j KL (q (zj) p (zj)). Since α = γ = 1 in β-TCVAE, therefore when λc = 0, the last line of Equation 15 is equivalent to β-TCVAE. C PROOF OF COMPOSITIONAL MINIMAL SUFFICIENCY In this section, we prove the statements reported in the section 2.2 of the paper. (P1): Data-processing inequality in the markov chain: considering the markov chain b a c, then I (b, a) I (b; c). (Theorem 2.8.1 (Cover, 1999).) (P2): Chain rule for mutual information: I (a; b, c) = I (a; b)+I (a; c|b). (Theorem 2.5.1 (Cover, 1999).) (P3): Decompositon of conditional mutual information. (Proposition B.1. (Federici et al., 2020).) Conditional independence assumptions: The Markov Chain ml+1 j x ml i, ml j , implies the conditional independence between ml+1 j and ml i, ml j when x is observed. I x; ml+1 j (P1) I ml+1 j ; ml i, ml j (P2) = I ml+1 j ; ml j + I ml+1 j ; ml i | ml j (P3) = I ml+1 j ; ml j + I ml+1 j ; ml i I ml i; ml j . Published as a conference paper at ICLR 2022 D EXPERIMENTAL DETAILS For all baselines, we use a Convolutional Neural Network for the encoder, and a Deconvolutional Neural Network for the decoder. Specially, for Factor-VAE, we use a 6-layer Multi-Layer Perceptron for the discriminator, with the leaky Re LU as the activation on per layer, and we set γ as 6.4. For β-VAE, we use a set of β = {4, 16, 60}. Annealed-VAE uses γ = 1000 with a linearly increasing C from 0.5 nats to a 25.0 nats). β-TCVAE uses α = γ = 1 and β = 4. DIP-VAE is implemented by the parameters λod = 100 and λd = 10. For Recur D, we implement Recur D 0 as the same encoder/decoder architecture with all baselines, as shown in Table 3. As for Recur D 1 and Recur D 2, we implement the same decoder architecture with all baselines, and the details about the encoder of Recur D 1 and Recur D 2 are shown in Table 4 and Table 5. Following the previous work, we use negative cross-entropy to compute reconstruction error, which represents sufficiency in our work. As for the computation of minimal sufficiency and disentanglement, we implement the MINE estimator by 4-layer Multi-Layer Perceptron, similar with the work of Federici et al. (2020). As for the hyperparameter in our model, we vary λ1 and λ2 in the set {0.1, 0.2, 0.5, 1, 2, 5, 10, 50} while fixing λ1 = 1 and λ1 = 2 for both datasets. During training, we use Adam optimiser with learning rate 1e 4, β1 = 0.9, β2 = 0.999 for parameter updates. Specially, we utilize Recur D 1 on d Sprites, 3DCars and 3DShapes, and Recur D 2 on Celeb A. Table 3: Encoder and Decoder architecture for all baselines and Recur D 0. Input d Sprites (3DShapes): 64 64 1(3) Images Conv: k=4, s=2, p=1, channel=32, Re LU Conv: k=4, s=2, p=1, channel=32, Re LU Conv: k=4, s=2, p=1, channel=64, Re LU Conv: k=4, s=2, p=1, channel=64, Re LU FC: 128 (256) FC: 2 8 Input: z R8 FC: 128, Re LU FC: 4 4 64, Re LU Deconv: k=4, s=2, p=1, channel=64, Re LU Deconv: k=4, s=2, p=1, channel=32, Re LU Deconv: k=4, s=2, p=1, channel=32, Re LU Deconv: k=4, s=2, p=1, channel=1(3), Re LU Output: 64 64 1 (3) Reconst. Images Table 4: Encoder architecture of Recur D 1. Input 64 64 1 (3) Image 0-th Recursive Module Router(v) - Go E 4 [Conv: k=4, s=2, p=1, channel=32, Re LU] Conv: k=4, s=2, p=1, channel=64, Re LU 1-th Recursive Module Router(v) FC, 8, Re LU Go E 8 Conv : k=4, s=2, p=1, channel=16, Re LU FC, 16; FC, 2 1 Published as a conference paper at ICLR 2022 Table 5: Encoder architecture of Recur D 2. Input 64 64 1 (3) Image 0-th Recursive Module Router(v) - Go E 2 [Conv: k=4, s=2, p=1, channel=32, Re LU ] 1-th Recursive Module Router(v) FC, 4, Re LU Go E 4 [Conv: k=4, s=2, p=1, channel=16, Re LU] 2-th Recursive Module Router(v) FC, 16, Re LU Go E 16 Conv : k=4, s=2, p=1, channel=16, Re LU FC, 16; FC, 2 1 E ADDITIONAL EXPERIMENTS This section reports additional experiments, which draw similar conclusions as that of Section 5. E.1 QUANTITATIVE RESULTS The first set of quantitative results is the supplementary results of disentanglement score DCI-C and DCI-I on d Sprites and 3DShapes. Table 6, Figure 8 and Figure 9 show the results of disentanglement score with reconstruction error, which also show Recur D can achieve a better trade-off between reconstruction and disentanglement. Figure 8: Reconstruction error vs. disentanglement performance on d Sprites. Scatters located at the left top indicate better performance. Figure 9: Reconstruction error vs. disentanglement performance on 3DShapes. Scatters located at the left top indicate better performance. The second set is the performance comparison results with three additional baselines, i.e, Info GAN (Chen et al., 2016), Control-VAE and NVAE. Info GAN (Chen et al., 2016) maximizes the mutual information between the small subset of the latent variables and the observations to increase the interpretability of the latent representation. Control-VAE (Shao et al., 2020) dynamically tunes Published as a conference paper at ICLR 2022 Table 6: DCI-C, DCI-I scores and reconstruction error on the test sets for d Sprites and 3DShapes. Bold face indicates the best results, i.e., reconstruction error and disentanglement scores. dataset d Sprites 3DShapes Reconst error DCI-C DCI-I Reconst error DCI-C DCI-I β-VAE(β = 4) 0.0066 0.1148 0.2181 0.0216 0.4463 0.1982 β-VAE(β = 16) 0.0094 0.1575 0.2087 0.0544 0.4691 0.1828 β-VAE(β = 60) 0.0127 0.1261 0.1769 0.0624 0.4215 0.1641 Annealed-VAE 0.0171 0.1793 0.1663 0.0811 0.4419 0.1682 Factor-VAE 0.0228 0.1375 0.1699 0.0800 0.4419 0.1707 β-TCVAE 0.0162 0.1988 0.1660 0.0312 0.4516 0.1774 DIP-VAE 0.0213 0.0968 0.1496 0.0213 0.4621 0.1961 Recur D 0.0047 0.3835 0.2222 0.0083 0.5668 0.2014 the weight β on the KL term to achieve a good trade-off between disentanglement and reconstruction quality. NVAE (Vahdat & Kautz, 2020) optimizes high-quality image generation via global correlation capturing across multi-layer latent variables. We compare with Info-GAN Control-VAE and NVAE on three datasets, including d Sprites, 3Dshapes and 3Dcars. The 3DCars (Reed et al., 2015) exhibits different car models from Fidler et al. (2012) under different camera viewpoints. The evaluation method is the disentanglement score MIG. Table 7 show that NVAE and Vamp Prior can achieve comparable reconstruction error compared to Recur D. However, their disentanglement qualities are not as good as Recur D, because Recur D additionally regularizes the inter-layer information sharing and alleviates the information redundancy of multiple layers. This experiment demonstrates the superiority of the proposed Recur D method in disentanglement compared to existing hierarchical VAEs. Table 7: MIG score and reconstruction error on the test sets for d Sprites, 3DShapes and 3DCars. Bold face indicates the best results, i.e., reconstruction error and disentanglement scores. dataset d Sprites 3DShapes 3DCars Reconst. Error MIG Reconst. Error MIG Reconst. Error MIG β-VAE(β = 4) 0.0066 0.2617 0.0216 0.2519 0.0376 0.1015 Info GAN - 0.1598 - 0.1874 - 0.1083 Control-VAE 0.0102 0.2455 0.0357 0.2630 0.0257 0.1583 NVAE 0.0041 0.0043 0.0078 0.0081 0.0118 0.0034 Recur D 0.0047 0.2707 0.0083 0.3105 0.0132 0.1762 E.2 ABLATION STUDY In this section, we report the supplementary results of ablation study. The first is to study the impact of compositional learning objective with varying regularization coefficients (λ1 and λ2) of minimal sufficiency and disentanglement. Figure E.2 and Figure E.2 show the scatter plots of disentanglement scores along with reconstruction error on the test sets of d Sprites, with varying λ1, λ2. Figure 12 and Figure 13 show the results on the 3DShapes. The second set is to evaluate the influence of hyperparameter k the group size in the Go E of Recursive Module. Figure E.2 and Figure E.2 show the influence of varying k for disentanglement scores and reconstruction error on the test sets of d Sprites. Figure E.2 and Figure E.2 show the results on the 3DShapes. The third set is to evaluate the disentanglement metrics on the preceding layers. We compute MIG on m L 1 and m L 2 of Recur D on 3DShapes. As shown in Table 8, high-level representations have higher MIG than that of low-level representations (z > m L 1 > m L 2). These results confirm that Published as a conference paper at ICLR 2022 the recursive propagation of inductive bias through the feed-forward network improves disentangled representation learning. The last set is to evaluate the impact of architecture of Gate-of-Encoders (Go E), we conduct three variant Go Es, including Linear: Go E is implemented as a linear layer with softmax; Fix: Go E is implemented as fix assignment, i.e., we split ml in d + 1 equal slices; and Att: Go E is implemented as a multi-head attention layer (the head is fixed as 8). As shown in Table 9, this study demonstrates that Go E indeed fits well with the disentanglement process. Figure 10: Ablation study on λ1 and λ2 on d Sprites. Figure 11: Ablation study on λ1 and λ2 on d Sprites. Table 8: Comparison of MIG on z, m L 1 and m L 2. m L 2 m L 1 z(m L) Reconst. Error MIG 0.1081 0.2892 0.3105 0.0083 Published as a conference paper at ICLR 2022 Figure 12: Ablation study on λ1 and λ2 on 3DShapes. Figure 13: Ablation study on λ1 and λ2 on 3DShapes. Figure 14: Ablation study on k on d Sprites. Published as a conference paper at ICLR 2022 Figure 15: Ablation study on k on d Sprites. Figure 16: Ablation study on k on 3DShapes. Figure 17: Ablation study on k on 3DShapes. Published as a conference paper at ICLR 2022 Table 9: Ablation study on the Go E architecture. Linear Fix Att Recur D MIG 0.2539 0.2617 0.2604 0.3105 Reconst. Error 0.0176 0.0116 0.0095 0.0083 E.3 QUALITATIVE RESULTS In this section, we report the qualitative samples of traversal images on three datasets, including d Sprites(Figure 18), 3DShapes(Figure 19) and Celeb A(from Figure 20 to Figure 25). Specifically, on Celeb A, we tentatively increase the dimensionality of latent variables of Recur D (from 16 to 32) by doubling the output of the last layer of each encoder from a single latent variable to a 2-dimensional variable. Figure 18: Traversal samples on d Sprites. Figure 19: Traversal samples on 3DShapes. Figure 20: Traversal samples on Celeb A (Azimuth). Figure 21: Traversal samples on Celeb A (Background Color). Published as a conference paper at ICLR 2022 Figure 22: Traversal samples on Celeb A (Face Color). Figure 23: Traversal samples on Celeb A (Face Width). Figure 24: Traversal samples on Celeb A (Gender). Figure 25: Traversal samples on Celeb A (Smile). E.4 EVALUATION OF COMPUTATIONAL COSTS In this section, we evaluate the computational complexity of our model on 3DShapes and Celeb A. The evaluation metrics consist of multiply-accumulate operation (MACs), the model parameters (Params), evaluation time (Eva. time) and converge time (all models converge to the same reconstruction error as β-VAE, denoted as Con. time). In terms of parameter efficiency, as shown in Table 10, the recursive disentanglement network itself only contains 0.826 million parameters (Recur D 2 w/o MINEs), which are comparable to Beta-VAE. Most of the parameters of Recur D 2 are contributed by using MINE estimator. Specifically, in the initial implementation (Recur D 2/specific MINEs), each pair of outputs of encoders is equipped with a specific MINE model, which contributes 7.214 million parameters. We further optimize the design by using shared MINE within the same feature category, which can effectively reduce the total number of parameters down to 1.325 million (Recur D 2/shared MINEs). Table 10: Complexity comparison of three models on 3DShapes and Celeb A. Dataset Method MACs(G) Params(M) Eva. time(s) Con. time(s) Recur D 1 3.466 3.672 0.005124 27.1212 Recur D 2 3.469 3.694 0.005354 23.3197 beta-VAE 3.144 0.769 0.003283 32.5251 Factor-VAE 3.401 4.779 0.003389 40.6231 Recur D 1 3.944 7.935 0.01065 29.7421 Recur D 2 3.957 8.040 0.01087 36.6408 Recur D 2 w/o MINEs 2.960 0.826 - - Recur D 2/shared MINEs 3.654 1.325 0.01072 36.6402 beta-VAE 3.145 0.769 0.01030 45.2067 Factor-VAE 3.402 4.792 0.01005 48.9939