# weakly_supervised_disentanglement_with_guarantees__db7adb6d.pdf Published as a conference paper at ICLR 2020 WEAKLY SUPERVISED DISENTANGLEMENT WITH GUARANTEES Rui Shu , Yining Chen , Abhishek Kumar , Stefano Ermon & Ben Poole Stanford University, Google Brain {ruishu,cynnjjs,ermon}@stanford.edu {abhishk,pooleb}@google.com Learning disentangled representations that correspond to factors of variation in real-world data is critical to interpretable and human-controllable machine learning. Recently, concerns about the viability of learning disentangled representations in a purely unsupervised manner has spurred a shift toward the incorporation of weak supervision. However, there is currently no formalism that identifies when and how weak supervision will guarantee disentanglement. To address this issue, we provide a theoretical framework to assist in analyzing the disentanglement guarantees (or lack thereof) conferred by weak supervision when coupled with learning algorithms based on distribution matching. We empirically verify the guarantees and limitations of several weak supervision methods (restricted labeling, match-pairing, and rank-pairing), demonstrating the predictive power and usefulness of our theoretical framework. 1 INTRODUCTION Many real-world datasets can be intuitively described via a data-generating process that first samples an underlying set of interpretable factors, and then conditional on those factors generates an observed data point. For example, in image generation, one might first sample the object identity and pose, and then render an image with the object in the correct pose. The goal of disentangled representation learning is to learn a representation where each dimension of the representation corresponds to a distinct factor of variation in the dataset (Bengio et al., 2013). Learning such representations that align with the underlying factors of variation may be critical to the development of machine learning models that are explainable or human-controllable (Gilpin et al., 2018; Lee et al., 2019; Klys et al., 2018). In recent years, disentanglement research has focused on the learning of such representations in an unsupervised fashion, using only independent samples from the data distribution without access to the true factors of variation (Higgins et al., 2017; Chen et al., 2018a; Kim & Mnih, 2018; Esmaeili et al., 2018). However, Locatello et al. (2019) demonstrated that many existing methods for the unsupervised learning of disentangled representations are brittle, requiring careful supervision-based hyperparameter tuning. To build robust disentangled representation learning methods that do not require large amounts of supervised data, recent work has turned to forms of weak supervision (Chen & Batmanghelich, 2019; Gabbay & Hoshen, 2019). Weak supervision can allow one to build models that have interpretable representations even when human labeling is challenging (e.g., hair style in face generation, or style in music generation). While existing methods based on weaklysupervised learning demonstrate empirical gains, there is no existing formalism for describing the theoretical guarantees conferred by different forms of weak supervision (Kulkarni et al., 2015; Reed et al., 2015; Bouchacourt et al., 2018). In this paper, we present a comprehensive theoretical framework for weakly supervised disentanglement, and evaluate our framework on several datasets. Our contributions are several-fold. 1. We formalize weakly-supervised learning as distribution matching in an extended space. Work done during an internship at Google Brain. Published as a conference paper at ICLR 2020 2. We propose a set of definitions for disentanglement that can handle correlated factors and are inspired by many existing definitions in the literature (Higgins et al., 2018; Suter et al., 2018; Ridgeway & Mozer, 2018). 3. Using these definitions, we provide a conceptually useful and theoretically rigorous calculus of disentanglement. 4. We apply our theoretical framework of disentanglement to analyze three notable classes of weak supervision methods (restricted labeling, match pairing, and rank pairing). We show that although certain weak supervision methods (e.g., style-labeling in style-content disentanglement) do not guarantee disentanglement, our calculus can determine whether disentanglement is guaranteed when multiple sources of weak supervision are combined. 5. Finally, we perform extensive experiments to systematically and empirically verify our predicted guarantees.1 2 FROM UNSUPERVISED TO WEAKLY SUPERVISED DISTRIBUTION MATCHING Our goal in disentangled representation learning is to identify a latent-variable generative model whose latent variables correspond to ground truth factors of variation in the data. To identify the role that weak supervision plays in providing guarantees on disentanglement, we first formalize the model families we are considering, the forms of weak supervision, and finally the metrics we will use to evaluate and prove components of disentanglement. We consider data-generating processes where S Rn are the factors of variation, with distribution p (s), and X Rm is the observed data point which is a deterministic function of S, i.e., X = g (S). Many existing algorithms in unsupervised learning of disentangled representations aim to learn a latent-variable model with prior p(z) and generator g, where g(Z) d= g (S). However, simply matching the marginal distribution over data is not enough: the learned latent variables Z and the true generating factors S could still be entangled with each other (Locatello et al., 2019). To address the failures of unsupervised learning of disentangled representations, we leverage weak supervision, where information about the data-generating process is conveyed through additional observations. By performing distribution matching on an augmented space (instead of just on the observation X), we can provide guarantees on learned representations. Restricted Labeling Match Pairing Rank Pairing Unsupervised Figure 1: Augmented data distributions derived from weak supervision. Shaded nodes denote observed quantities, and unshaded nodes represent unobserved (latent) variables. We consider three practical forms of weak supervision: restricted labeling, match pairing, and rank pairing. All of these forms of supervision can be thought of as augmented forms of the original joint distribution, where we partition the latent variables in two S = (SI, S\I), and either observe a subset of the latent variables or share latents between multiple samples. A visualization of these augmented distributions is presented in Figure 1, and below we detail each form of weak supervision. In restricted labeling, we observe a subset of the ground truth factors, SI in addition to X. This allows us to perform distribution matching on p (s I, x), the joint distribution over data and observed factors, instead of just the data, p (x), as in unsupervised learning. This form of supervision is often leveraged in style-content disentanglement, where labels are available for content but not style (Kingma et al., 2014; Narayanaswamy et al., 2017; Chen et al., 2018b; Gabbay & Hoshen, 2019). 1Code available at https://github.com/google-research/google-research/tree/master/weak disentangle Published as a conference paper at ICLR 2020 Match Pairing uses paired data, (x, x ) that share values for a known subset of factors, I. For many data modalities, factors of variation may be difficult to explicitly label. Instead, it may be easier to collect pairs of samples that share the same underlying factor (e.g., collecting pairs of images of different people wearing the same glasses is easier than defining labels for style of glasses). Match pairing is a weaker form of supervision than restricted labeling, as the learning algorithm no longer depends on the underlying value s I, and only on the indices of shared factors I. Several variants of match pairing have appeared in the literature (Kulkarni et al., 2015; Bouchacourt et al., 2018; Ridgeway & Mozer, 2018), but typically focus on groups of observations in contrast to the paired setting we consider in this paper. Rank Pairing is another form of paired data generation where the pairs (x, x ) are generated in an i.i.d. fashion, and an additional indicator variable y is observed that determines whether the corresponding latent si is greater than s i: y = 1 {si s i}. Such a form of supervision is effective when it is easier to compare two samples with respect to an underlying factor than to directly collect labels (e.g., comparing two object sizes versus providing a ruler measurement of an object). Although supervision via ranking features prominently in the metric learning literature (Mc Fee & Lanckriet, 2010; Wang et al., 2014), our focus in this paper will be on rank pairing in the context of disentanglement guarantees. For each form of weak supervision, we can train generative models with the same structure as in Figure 1, using data sampled from the ground truth model and a distribution matching objective. For example, for match pairing, we train a generative model (p(z), g) such that the paired random variable (g(ZI, Z\I), g(ZI, Z \I)) from the generator matches the distribution of the corresponding paired random variable (g (SI, S\I), g (SI, S \I)) from the augmented data distribution. 3 DEFINING DISENTANGLEMENT To identify the role that weak supervision plays in providing guarantees on disentanglement, we introduce a set of definitions that are consistent with our intuitions about what constitutes disentanglement and amenable to theoretical analysis. Our new definitions decompose disentanglement into two distinct concepts: consistency and restrictiveness. Different forms of weak supervision can enable consistency or restrictiveness on subsets of factors, and in Section 4 we build up a calculus of disentanglement from these primitives. We discuss the relationship to prior definitions of disentanglement in Appendix A. 3.1 DECOMPOSING DISENTANGLEMENT INTO CONSISTENCY AND RESTRICTIVENESS (a) Disentanglement (b) Consistency (c) Restrictiveness Figure 2: Illustration of disentanglement, consistency, and restrictiveness of z1 with respect to the factor of variation size. Each image of a shape represents the decoding g(z1:3) by the generative model. Each column denotes a fixed choice of z1. Each row denotes a fixed choice of (z2, z3). A demonstration of consistency versus restrictiveness on models from disentanglement lib is available in Appendix B. To ground our discussion of disentanglement, we consider an oracle that generates shapes with factors of variation for size (S1), shape (S2), and color (S3). How can we determine whether Z1 of our generative model disentangles the concept of size? Intuitively, one way to check whether Z1 of the generative model disentangles size (S1) is to visually inspect what happens as we vary Z1, Z2, and Z3, and see whether the resulting visualizations are consistent with Figure 2a. In doing so, our visual inspection checks for two properties: Published as a conference paper at ICLR 2020 1. When Z1 is fixed, the size (S1) of the generated object never changes. 2. When only Z1 is changed, the change is restricted to the size (S1) of the generated object, meaning that there is no change in Sj for j = 1. We argue that disentanglement decomposes into these two properties, which we refer to as generator consistency and generator restrictiveness. Next, we formalize these two properties. Let H be a hypothesis class of generative models from which we assume the true data-generating function is drawn. Each element of the hypothesis class H is a tuple (p(s), g, e), where p(s) describes the distribution over factors of variation, the generator g is a function that maps from the factor space S Rn to the observation space X Rm, and the encoder e is a function that maps from X S. S and X can consist of both discrete and continuous random variables. We impose a few mild assumptions on H (see Appendix I.1). Notably, we assume every factor of variation is exactly recoverable from the observation X, i.e. e(g(S)) = S. Given an oracle model h = (p , g , e ) H, we would like to learn a model h = (p, g, e) H whose latent variables disentangle the latent variables in h . We refer to the latent-variables in the oracle h as S and the alternative model h s latent variables as Z. If we further restrict h to only those models where g(Z) d= g (S) are equal in distribution, it is natural to align Z and S via S = e g(Z). Under this relation between Z and S, our goal is to construct definitions that describe whether the latent code Zi disentangles the corresponding factor Si. Generator Consistency. Let I denote a set of indices and p I denote the generating process z I p(z I) (1) z\I, z \I iid p(z\I | z I). (2) This generating process samples ZI once and then conditionally samples ZI twice in an i.i.d. fashion. We say that ZI is consistent with SI if Ep I e I g(z I, z\I) e I g(z I, z \I) 2 = 0, (3) where e I is the oracle encoder restricted to the indices I. Intuitively, Equation (3) states that, for any fixed choice of ZI, resampling of Z\I will not influence the oracle s measurement of the factors SI. In other words, SI is invariant to changes in Z\I. An illustration of a generative model where Z1 is consistent with size (S1) is provided in Figure 2b. A notable property of our definition is that the prescribed sampling process p I does not require the underlying factors of variation to be statistically independent. We characterize this property in contrast to previous definitions of disentanglement in Appendix A. Generator Restrictiveness. Let p\I denote the generating process z\I p(z\I) (4) z I, z I iid p(z I | z\I). (5) We say that ZI is restricted to SI if Ep\I e \I g(z I, z\I) e \I g(z I, z\I) 2 = 0. (6) Equation (6) states that, for any fixed choice of Z\I, resampling of ZI will not influence the oracle s measurement of the factors S\I. In other words, S\I is invariant to changes in ZI. Thus, changing ZI is restricted to modifying only SI. An illustration of a generative model where Z1 is restricted to size (S1) is provided in Figure 2c. Generator Disentanglement. We now say that ZI disentangles SI if ZI is consistent with and restricted to SI. If we denote consistency and restrictiveness via Boolean functions C(I) and R(I), we can now concisely state that D(I) := C(I) R(I), (7) where D(I) denotes whether ZI disentangles SI. An illustration of a generative model where Z1 disentangles size (S1) is provided in Figure 2a. Note that while size increases monotonically with Z1 in the schematic figure, we wish to clarify that monotonicity is unrelated to the concepts of consistency and restrictiveness. Published as a conference paper at ICLR 2020 3.2 RELATION TO BIJECTIVITY-BASED DEFINITION OF DISENTANGLEMENT Under our mild assumptions on H, distribution matching on g(Z) d= g(S) combined with generator disentanglement on factor I implies the existence of two invertible functions f I and f\I such that the alignment via S = e g(Z) decomposes into SI S\I = f I(ZI) f\I(Z\I) This expression highlights the connection between disentanglement and invariance, whereby SI is only influenced by ZI, and S\I is only influenced by Z\I. However, such a bijectivity-based definition of disentanglement does not naturally expose the underlying primitives of consistency and restrictiveness, which we shall demonstrate in our theory and experiments to be valuable concepts for describing disentanglement guarantees under weak supervision. 3.3 ENCODER-BASED DEFINITIONS FOR DISENTANGLEMENT Our proposed definitions are asymmetric measuring the behavior of a generative model against an oracle encoder. So far, we have chosen to present the definitions from the perspective of a learned generator (p, g) measured against an oracle encoder e . In this sense, they are generator-based definitions. We can also develop a parallel set of definitions for encoder-based consistency, restrictiveness, and disentanglement within our framework simply by using an oracle generator (p , g ) measured against a learned encoder e. Below, we present the encoder-based perspective on consistency. Encoder Consistency. Let p I denote the generating process s I p (s I) (9) s\I, s \I iid p (s\I, | s I). (10) This generating process samples SI once and then conditionally samples SI twice in an i.i.d. fashion. We say that SI is consistent with ZI if Ep I e I g (s I, s\I) e I g (s I, s \I) 2 = 0. (11) We now make two important observations. First, a valuable trait of our encoder-based definitions is that one can check for encoder consistency / restrictiveness / disentanglement as long as one has access to match pairing data from the oracle generator. This is in contrast to the existing disentanglement definitions and metrics, which require access to the ground truth factors (Higgins et al., 2017; Kumar et al., 2018; Kim & Mnih, 2018; Chen et al., 2018a; Suter et al., 2018; Ridgeway & Mozer, 2018; Eastwood & Williams, 2018). The ability to check for our definitions in a weakly supervised fashion is the key to why we can develop a theoretical framework using the language of consistency and restrictiveness. Second, encoder-based definitions are tractable to measure when testing on synthetic data, since the synthetic data directly serves the role of the oracle generator. As such, while we develop our theory to guarantee both generator-based and the encoder-based disentanglement, all of our measurements in the experiments will be conducted with respect to a learned encoder. We make three remarks on notations. First, D(i) := D({i}). Second, D( ) evaluates to true. Finally, D(I) is implicitly dependent on either (p, g, e ) (generator-based) or (p , g , e) (encoderbased). Where important, we shall make this dependency explicit (e.g., let D(I ; p, g, e ) denote generator-based disentanglement). We apply these conventions to C and R analogously. 4 A CALCULUS OF DISENTANGLEMENT There are several interesting relationships between restrictiveness and consistency. First, by definition, C(I) is equivalent to R(\I). Second, we can see from Figures 2b and 2c that C(I) and R(I) do not imply each other. Based on these observations and given that consistency and restrictiveness operate over subsets of the random variables, a natural question that arises is whether consistency or restrictiveness over certain sets of variables imply additional properties over other sets of variables. Published as a conference paper at ICLR 2020 We develop a calculus for discovering implied relationships between learned latent variables Z and ground truth factors of variation S given known relationships as follows. Calculus of Disentanglement Consistency and Restrictiveness C(I) = R(I) R(I) = C(I) C(I) R(\I) Union Rules C(I) C(J) = C(I J) R(I) R(J) = R(I J) Intersection Rules C(I) C(J) = C(I J) R(I) R(J) = R(I J) Full Disentanglement Vn i=1 C(i) Vn i=1 D(i) Vn i=1 R(i) Vn i=1 D(i) Our calculus provides a theoretically rigorous procedure for reasoning about disentanglement. In particular, it is no longer necessary to prove whether the supervision method of interest satisfies consistency and restrictiveness for each and every factor. Instead, it suffices to show that a supervision method guarantees consistency or restrictiveness for a subset of factors, and then combine multiple supervision methods via the calculus to guarantee full disentanglement. We can additionally use the calculus to uncover consistency or restrictiveness on individual factors when weak supervision is available only for a subset of variables. For example, achieving consistency on S1,2 and S2,3 implies consistency on the intersection S2. Furthermore, we note that these rules are agnostic to using generator or encoder-based definitions. We defer the complete proofs to Appendix I.2. 5 FORMALIZING WEAK SUPERVISION WITH GUARANTEES In this section, we address the question of whether disentanglement arises from the supervision method or model inductive bias. This challenge was first put forth by Locatello et al. (2019), who noted that unsupervised disentanglement is heavily reliant on model inductive bias. As we transition toward supervised approaches, it is crucial that we formalize what it means for disentanglement to be guaranteed by weak supervision. Sufficiency for Disentanglement. Let P denote a family of augmented distributions. We say that a weak supervision method S : H P is sufficient for learning a generator whose latent codes ZI disentangle the factors SI if there exists a learning algorithm A : P H such that for any choice of (p (s), g , e ) H, the procedure A S(p (s), g , e ) returns a model (p(z), g, e) for which both D(I ; p, g, e ) and D(I ; p , g , e) hold, and g(Z) d= g (S). The key insight of this definition is that we force the strategy and learning algorithm pair (S, A) to handle all possible oracles drawn from the hypothesis class H. This prevents the exploitation of model inductive bias, since any bias from the learning algorithm A toward a reduced hypothesis class ˆH H will result in failure to handle oracles in the complementary hypothesis class H \ ˆH. The distribution matching requirement g(Z) d= g (S) ensures latent code informativeness, i.e., preventing trivial solutions where the latent code is uninformative (see Proposition 6 for formal statement). Intuitively, distribution matching paired with a deterministic generator guarantees invertibility of the learned generator and encoder, enforcing that ZI cannot encode less information than SI (e.g., only encoding age group instead of numerical age) and vice versa. 6 ANALYSIS OF WEAK SUPERVISION METHODS We now apply our theoretical framework to three practical weak supervision methods: restricted labeling, match pairing, and rank pairing. Our main theoretical findings are that: (1) these methods can be applied in a targeted manner to provide single factor consistency or restrictiveness guarantees; (2) by enforcing consistency (or restrictiveness) on all factors, we can learn models with strong Published as a conference paper at ICLR 2020 disentanglement performance. Correspondingly, Figure 3 and Figure 5 are our main experimental results, demonstrating that these theoretical guarantees have predictive power in practice. 6.1 THEORETICAL GUARANTEES FROM WEAK SUPERVISION We prove that if a training algorithm successfully matches the generated distribution to data distribution generated via restricted labeling, match pairing, or rank pairing of factors SI, then ZI is guaranteed to be consistent with SI: Theorem 1. Given any oracle (p (s), g , e ) H, consider the distribution-matching algorithm A that selects a model (p(z), g, e) H such that: 1. (g (S), SI) d= (g(Z), ZI) (Restricted Labeling); or g (SI, S\I), g (SI, S \I) d= g(ZI, Z\I), g(ZI, Z \I) (Match Pairing); or 3. (g (S), g (S ), 1 {SI S I}) d= (g(Z), g(Z ), 1 {ZI Z I}) (Rank Pairing). Then (p, g) satisfies C(I ; p, g, e ) and e satisfies C(I ; p , g , e). Theorem 1 states that distribution-matching under restricted labeling, match pairing, or rank pairing of SI guarantees both generator and encoder consistency for the learned generator and encoder respectively. We note that while the complement rule C(I) = R(\I) further guarantees that Z\I is restricted to S\I, we can prove that the same supervision does not guarantee that ZI is restricted to SI (Theorem 2). However, if we additionally have restricted labeling for S\I, or match pairing for S\I, then we can see from the calculus that we will have guaranteed R(I) C(I), thus implying disentanglement of factor I. We also note that while restricted labeling and match pairing can be applied on a set of factors at once (i.e. |I| 1), rank pairing is restricted to one-dimensional factors for which an ordering exists. In the experiments below, we empirically verify the theoretical guarantees provided in Theorem 1. 6.2 EXPERIMENTS We conducted experiments on five prominent datasets in the disentanglement literature: Shapes3D (Kim & Mnih, 2018), d Sprites (Higgins et al., 2017), Scream-d Sprites (Locatello et al., 2019), Small NORB (Le Cun et al., 2004), and Cars3D (Reed et al., 2015). Since some of the underlying factors are treated as nuisance variables in Small NORB and Scream-d Sprites, we show in Appendix C that our theoretical framework can be easily adapted accordingly to handle such situations. We use generative adversarial networks (GANs, Goodfellow et al. (2014)) for learning (p, g) but any distribution matching algorithm (e.g., maximum likelihood training in tractable models, or VI in latent-variable models) could be applied. Our results are collected over a broad range of hyperparameter configurations (see Appendix H for details). Since existing quantitative metrics of disentanglement all measure the performance of an encoder with respect to the true data generator, we trained an encoder post-hoc to approximately invert the learned generator, and measured all quantitative metrics (e.g., mutual information gap) on the encoder. Our theory assumes that the learned generator must be invertible. While this is not true for conventional GANs, our empirical results show that this is not an issue in practice (see Appendix G). We present three sets of experimental results: (1) Single-factor experiments, where we show that our theory can be applied in a targeted fashion to guarantee consistency or restrictiveness of a single factor. (2) Consistency versus restrictiveness experiments, where we show the extent to which single-factor consistency and restrictiveness are correlated even when the models are only trained to maximize one or the other. (3) Full disentanglement experiments, where we apply our theory to fully disentangle all factors. A more extensive set of experiments can be found in the Appendix. 6.2.1 SINGLE-FACTOR CONSISTENCY AND RESTRICTIVENESS We empirically verify that single-factor consistency or restrictiveness can be achieved with the supervision methods of interest. Note there are two special cases of match pairing: one where Si is Published as a conference paper at ICLR 2020 0 1 2 3 4 5 Restricted Labeling at Factor i 0 1 2 3 4 5 Normalized Consistency Score at Factor i 0.99 0.37 0.22 0.22 0.19 0.22 0.17 0.99 0.2 0.34 0.23 0.29 0.33 0.24 0.99 0.091 0.14 0.16 -0.0034 0.0034 0.016 1 0.0024 0.0055 0.021 0.0097 0.033 -3.9e-05 1 0.012 -0.003 0.024 0.011 0.012 0.0032 1 0 1 2 3 4 5 Share Pairing at Factor i 0 1 2 3 4 5 Normalized Consistency Score at Factor i 1 0.21 0.19 0.18 0.14 0.3 0.29 1 0.28 0.11 0.2 0.26 0.25 0.14 1 0.18 0.17 0.13 -0.0028 -0.00078 0.012 0.99 0.0055 0.0043 0.013 0.014 0.021 0.0055 0.99 0.0019 0.007 0.061 0.044 0.0058 0.017 1 0 1 2 3 4 5 Change Pairing at Factor i 0 1 2 3 4 5 Normalized Restrictiveness Score at Factor i 1 0.7 0.63 0.68 0.72 0.67 0.66 1 0.72 0.69 0.72 0.74 0.61 0.64 1 0.81 0.77 0.81 0.93 0.93 0.93 1 0.95 0.95 0.84 0.9 0.88 0.93 1 0.91 0.92 0.91 0.87 0.91 0.91 1 0 1 2 3 4 5 Rank Pairing at Factor i 0 1 2 3 4 5 Normalized Consistency Score at Factor i 0.99 0.34 0.36 0.13 0.21 0.23 0.18 0.99 0.42 0.27 0.48 0.31 0.13 0.39 0.99 0.064 0.13 0.058 0.0019 -0.0029 0.013 0.96 -0.0009 -0.0003 0.036 0.014 0.019 0.0083 0.98 0.013 -0.0066 0.035 0.021 0.00028 0.0059 0.99 0 1 2 3 4 5 Change Pair Intersection at Factor i 0 1 2 3 4 5 Normalized Restrictiveness Score at Factor i 1 0.84 0.83 0.8 0.79 0.77 0.81 1 0.85 0.79 0.81 0.78 0.95 0.95 0.95 0.93 0.91 0.94 0.95 0.93 0.94 1 0.91 0.97 0.83 0.84 0.92 0.85 1 0.94 0.82 0.78 0.79 0.81 0.86 1 Figure 3: Heatmap visualization of ablation studies that measure either single-factor consistency or single-factor restrictiveness as a function of various supervision methods, conducted on Shapes3D. Our theory predicts the diagonal components to achieve the highest scores. Note that share pairing, change pairing, and change pair intersection are special cases of match pairing. the only factor that is shared between x and x and one where Si is the only factor that is changed. We distinguish these two conditions as share pairing and change pairing, respectively. Theorem 1 shows that restricted labeling, share pairing, and rank pairing of the ith factor are each sufficient supervision strategies for guaranteeing consistency on Si. Change pairing at Si is equivalent to share pairing at S\i; the complement rule C(I) R(\I) allows us to conclude that change pairing guarantees restrictiveness. The first four heatmaps in Figure 3 show the results for restricted labeling, share pairing, change pairing, and rank pairing. The numbers shown in the heatmap are the normalized consistency and restrictiveness scores. We define the normalized consistency score as c(I ; p , g , e) = 1 Ep I e I g (s I, s\I) e I g (s I, s \I) 2 Es,s iid p e I g (s) e I g (s ) 2 . (12) This score is bounded on the interval [0, 1] (a consequence of Lemma 1) and is maximal when C(I ; p , g , e) is satisfied. This normalization procedure is similar in spirit to the Interventional Robustness score in Suter et al. (2018). The normalized restrictiveness score r can be analogously defined. In practice, we estimate this score via Monte Carlo estimation. The final heatmap in Figure 3 demonstrates the calculus of intersection. In practice, it may be easier to acquire paired data where multiple factors change simultaneously. If we have access to two kinds of datasets, one where SI are changed and one where SJ are changed, our calculus predicts that training on both datasets will guarantee restrictiveness on SI J. The final heatmap shows six such intersection settings and measures the normalized restrictiveness score; in all but one setting, the results are consistent with our theory. We show in Figure 7 that this inconsistency is attributable to the failure of the GAN to distribution-match due to sensitivity to a specific hyperparameter. 6.2.2 CONSISTENCY VERSUS RESTRICTIVENESS 0.78 -0.05 -0.07 -0.05 0.03 -0.10 -0.03 0.85 0.01 -0.11 -0.04 0.11 -0.20 -0.11 0.60 0.19 0.04 0.09 -0.18 -0.13 -0.25 0.61 -0.08 0.05 -0.27 -0.05 -0.19 0.09 0.60 0.08 -0.06 -0.23 -0.24 0.03 0.08 0.56 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 0 Normalized Restrictiveness Score for Factor 0 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 1 Normalized Restrictiveness Score for Factor 1 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 2 Normalized Restrictiveness Score for Factor 2 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 3 Normalized Restrictiveness Score for Factor 3 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 4 Normalized Restrictiveness Score for Factor 4 0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 5 Normalized Restrictiveness Score for Factor 5 Figure 4: Correlation plot and scatterplots demonstrating the empirical relationship between c(i) and r(i) across all 864 models trained on Shapes3D. We now determine the extent to which consistency and restrictiveness are correlated in practice. In Figure 4, we collected all 864 Shapes3D models that we trained in Section 6.2.1 and measured the consistency and restrictiveness of each model on each factor, providing both the correlation plot and scatterplots of c(i) versus r(i). Since the models trained in Section 6.2.1 only ever targeted the consistency or restrictiveness of a single factor, and since our calculus demonstrates that consistency and restrictiveness do not imply each other, one might a priori expect to find no correlation in Figure 4. Our results show that the correlation is actually quite strong. Since this correlation is not guaranteed by our choice of weak supervision, it is necessarily a consequence of model inductive Published as a conference paper at ICLR 2020 bias. We believe this correlation between consistency and restrictiveness to have been a general source of confusion in the disentanglement literature, causing many to either observe or believe that restricted labeling or share pairing on Si (which only guarantees consistency) is sufficient for disentangling Si (Kingma et al., 2014; Chen & Batmanghelich, 2019; Gabbay & Hoshen, 2019; Narayanaswamy et al., 2017). It remains an open question why consistency and restrictiveness are so strongly correlated when training existing models on real-world data. 6.2.3 FULL DISENTANGLEMENT None Share Change Rank Full-Label Supervision Method Mutual Information Gap None Share Change Rank Full-Label Supervision Method None Share Change Rank Full-Label Supervision Method Scream-d Sprites None Share Change Rank Full-Label Supervision Method None Share Change Rank Full-Label Supervision Method Figure 5: Disentanglement performance of a vanilla GAN, share pairing GAN, change pairing GAN, rank pairing GAN, and fully-labeled GAN, as measured by the mutual information gap across several datasets. A comprehensive set of performance evaluations on existing disentanglement metrics is available in Figure 13. If we have access to share / change / rank-pairing data for each factor, our calculus states that it is possible to guarantee full disentanglement. We trained our generative model on either complete share pairing, complete change pairing, or complete rank pairing, and measured disentanglement performance via the discretized mutual information gap (Chen et al., 2018a; Locatello et al., 2019). As negative and positive controls, we also show the performance of an unsupervised GAN and a fully-supervised GAN where the latents are fixed to the ground truth factors of variation. Our results in Figure 5 empirically verify that combining single-factor weak supervision datasets leads to consistently high disentanglement scores. 7 CONCLUSION In this work, we construct a theoretical framework to rigorously analyze the disentanglement guarantees of weak supervision algorithms. Our paper clarifies several important concepts, such as consistency and restrictiveness, that have been hitherto confused or overlooked in the existing literature, and provides a formalism that precisely distinguishes when disentanglement arises from supervision versus model inductive bias. Through our theory and a comprehensive set of experiments, we demonstrated the conditions under which various supervision strategies guarantee disentanglement. Our work establishes several promising directions for future research. First, we hope that our formalism and experiments inspire greater theoretical and scientific scrutiny of the inductive biases present in existing models. Second, we encourage the search for other learning algorithms (besides distribution-matching) that may have theoretical guarantees when paired with the right form of supervision. Finally, we hope that our framework enables the theoretical analysis of other promising weak supervision methods. ACKNOWLEDGMENTS We would like to thank James Brofos and Honglin Yuan for their insightful discussions on the theoretical analysis in this paper, and Aditya Grover and Hung H. Bui for their helpful feedback during the course of this project. Published as a conference paper at ICLR 2020 Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Diane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin. Multi-level variational autoencoder: Learning disentangled representations from grouped observations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. Junxiang Chen and Kayhan Batmanghelich. Weakly supervised disentanglement by pairwise similarities. ar Xiv preprint ar Xiv:1906.01044, 2019. Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in Neural Information Processing Systems, pp. 2610 2620, 2018a. Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C Cobo, Andrew Trask, Ben Laurie, et al. Sample efficient adaptive text-to-speech. ar Xiv preprint ar Xiv:1809.10460, 2018b. Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. ICLR, 2018. Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, Narayanaswamy Siddharth, Brooks Paige, Dana H Brooks, Jennifer Dy, and Jan-Willem van de Meent. Structured disentangled representations. ar Xiv preprint ar Xiv:1804.02086, 2018. Aviv Gabbay and Yedid Hoshen. Latent optimization for non-adversarial representation disentanglement. ar Xiv preprint ar Xiv:1906.11796, 2019. Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), pp. 80 89. IEEE, 2018. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Luigi Gresele, Paul K Rubenstein, Arash Mehrjou, Francesco Locatello, and Bernhard Sch olkopf. The incomplete rosetta stone problem: Identifiability results for multi-view nonlinear ica. ar Xiv preprint ar Xiv:1905.06642, 2019. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR, 2(5):6, 2017. Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations. ar Xiv preprint ar Xiv:1812.02230, 2018. Hyunjik Kim and Andriy Mnih. Disentangling by factorising. ICML, 2018. Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. Advances in neural information processing systems, pp. 3581 3589, 2014. Jack Klys, Jake Snell, and Richard Zemel. Learning latent subspaces in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 6444 6454, 2018. Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pp. 2539 2547, 2015. Published as a conference paper at ICLR 2020 Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, 2018. Yann Le Cun, Fu Jie Huang, Leon Bottou, et al. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR (2), pp. 97 104. Citeseer, 2004. Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, and Ming Hsuan Yang. Drit++: Diverse image-to-image translation via disentangled representations. ar Xiv preprint ar Xiv:1905.01270, 2019. Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Sch olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. ICML, 2019. Brian Mc Fee and Gert R Lanckriet. Metric learning to rank. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 775 782, 2010. Takeru Miyato and Masanori Koyama. cgans with projection discriminator. ar Xiv preprint ar Xiv:1802.05637, 2018. Siddharth Narayanaswamy, T Brooks Paige, Jan-Willem Van de Meent, Alban Desmaison, Noah Goodman, Pushmeet Kohli, Frank Wood, and Philip Torr. Learning disentangled representations with semi-supervised deep generative models. In Advances in Neural Information Processing Systems, pp. 5925 5935, 2017. Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. In Advances in neural information processing systems, pp. 1252 1260, 2015. Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the f-statistic loss. Advances in Neural Information Processing Systems, pp. 185 194, 2018. Raphael Suter, Dorde Miladinovic, Stefan Bauer, and Bernhard Sch olkopf. Interventional robustness of deep latent variable models. ICML, 2018. Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386 1393, 2014. Published as a conference paper at ICLR 2020 Our appendix consists of nine sections. We provide a brief summary of each section below. Appendix A: We elaborate on the connections between existing definitions of disentanglement and our definitions of consistency / restrictiveness / disentanglement. In particular, we highlight three notable properties of our definitions not present in many existing definitions. Appendix B: We evaluate our consistency and restrictiveness metrics on the 10800 models in the disentanglement lib, and identify models where consistency and restrictiveness are not correlated. Appendix C: We adapt our definitions to be able to handle nuisance variables. We do so through a simple modification of the definition of restrictiveness. Appendix D: We show several additional single-factor experiments. We first address one of the results in the main text that is not consistent with our theory, and explain why it can be attributed to hyperparameter sensitivity. We next unwrap the heatmaps into more informative boxplots. Appendix E: We provide an additional suite of consistency versus restrictiveness experiments by comparing the effects of training with share pairing (which guarantees consistency), change pairing (which guarantees restrictiveness), and both. Appendix F: We provide full disentanglement results on all five datasets as measured according to six different metrics of disentanglement found in the literature. Appendix G: We show visualizations of a weakly supervised generative model trained to achieve full disentanglement. Appendix H: We describe the set of hyperparameter configurations used in all our experiments. Appendix I: We provide the complete set of assumptions and proofs for our theoretical framework. A CONNECTIONS TO EXISTING DEFINITIONS Numerous definitions of disentanglement are present in the literature (Higgins et al., 2017; 2018; Kim & Mnih, 2018; Suter et al., 2018; Ridgeway & Mozer, 2018; Eastwood & Williams, 2018; Chen et al., 2018a). We mostly defer to the terminology suggested by Ridgeway & Mozer (2018), which decomposes disentanglement into modularity, compactness, and explicitness. Modularity means a latent code Zi is predictive of at most one factor of variation Sj. Compactness means a factor of variation Si is predicted by at most one latent code Zj. And explicitness means a factor of variation Sj is predicted by the latent codes via a simple transformation (e.g. linear). Similar to Eastwood & Williams (2018); Higgins et al. (2018), we suggest a further decomposition of Ridgeway & Mozer (2018) s explicitness into latent code informativeness and latent code simplicity. In this paper, we omit latent code simplicity from consideration. Since informativeness of the latent code is already enforced by our requirement that g(Z) is equal in distribution to g (S) (see Proposition 6), we focus on comparing our proposed concepts of consistency and restrictiveness to modularity and compactness. We make note of three important distinctions. Restrictiveness is not synonymous with either modularity or compactness. In Figure 2c, it is evident the factor of variation size is not predictable any individual Zi (conversely, Z1 is not predictable from any individual factor Si). As such, Z1 is neither a modular nor compact representation of size, despite being restricted to size. To our knowledge, no existing quantitative definition of disentanglement (or its decomposition) specifically measures restrictiveness. Consistency and restrictiveness are invariant to statistically dependent factors of variation. Many existing definitions of disentanglement are instantiated by measuring the mutual information between Z and S. For example, Ridgeway & Mozer (2018) defines that a latent code Zi to be ideally modular if it has high mutual information with a single factor Sj and zero mutual information with all other factors S\j. This presents a issue when the true factors of variation themselves are statistically dependent; even if Z1 = S1, the latent code Z1 would violate modularity if S1 itself has positive mutual information with S2. Consistency and restrictiveness circumvent this issue by relying on conditional resampling. Consistency, for example, only measures the extent to which SI Published as a conference paper at ICLR 2020 is invariant to resampling of Z\I when conditioned on ZI and is thus achieved as long as s I is a function of only z I irrespective of whether s I and s\I are statistically dependent. In this regard, our definitions draw inspiration from Suter et al. (2018) s intervention-based definition but replaces the need for counterfactual reasoning with the simpler conditional sampling. Because we do not assume the factors of variation are statistically independent, our theoretical analysis is also distinct from the closely-related match pairing analysis in Gresele et al. (2019). Consistency and restrictiveness arise in weak supervision guarantees. One of our goals is to propose definitions that are amenable to theoretical analysis. As we can see in Section 4, consistency and restrictiveness serve as the core primitive concepts that we use to describe disentanglement guarantees conferred by various forms of weak supervision. B EVALUATING CONSISTENCY AND RESTRICTIVENESS ON DISENTANGLEMENT-LIB MODELS To better understand the empirical relationship between consistency and restrictiveness, we calculated the normalized consistency and restrictiveness scores on the suite of 12800 models from disentanglement lib for each ground-truth factor. By using the normalized consistency and restrictiveness scores as probes, we were able to identify models that achieve high consistency but low restrictiveness (and vice versa). In Fig. 6, we highlight two models that are either consistent or restrictive for object color on the Shapes3D dataset. (a) Consistent but not Restrictive (b) Restrictive but not Consistent Figure 6: Visualization of two models from disentanglement lib (model ids 11964 and 12307), matching the schematic in Fig. 2. For each panel, we visualize an interpolation along a single latent across rows, with each row corresponding to a fixed set of values for all other factors. In Fig. 6a, we can see that this factor consistenly represents object color, i.e. each column of images has the same object color, but as we move along rows we see that other factors change as well, e.g. object type, thus this factor is not restricted to object color. In Fig. 6b, we see that varying the factor along each row results in changes to object color but to no other attributes. However if we look across columns, we see that the representation of color changes depending on the setting of other factors, thus this factor is not consistent for object color. C HANDLING NUISANCE VARIABLES Our theoretical framework can handle nuisance variables, i.e., variables we cannot measure or perform weak supervision on. It may be impossible to label, or provide match-pairing on that factor of variation. For example, while many features of an image are measurable (such as brightness and Published as a conference paper at ICLR 2020 coloration), we may not be able to measure certain factors of variation or generate data pairs where these factors are kept constant. In this case, we can let one additional variable η act as nuisance variable that captures all additional sources of variation / stochasticity. Formally, suppose the full set of true factors is S {η} Rn+1. We define η-consistency Cη(I) = C(I) and η-restrictiveness Rη(I) = R(I {η}). This captures our intuition that, with nuisance variable, for consistency, we still want changes to Z\I {η} to not modify SI; for restrictiveness, we want changes to ZI {η} to only modify SI {η}. We define η-disentanglement as Dη(I) = Cη(I) Rη(I). All of our calculus still holds where we substitute Cη(I), Rη(I), Dη(I) for C(I), R(I), D(I); we prove one of the new full disentanglement rule as an illustration: Proposition 1. Vn i=1 Cη(i) Vn i=1 Dη(i). Proof. On the one hand, Vn i=1 Cη(i) Vn i=1 C(i) = C(1 : n) = R(η). On the other hand, Vn i=1 C(i) = Vn i=1 D(i) = Vn i=1 R(i). Therefore LHS = i [n], R(i) R(η) = Rη(i). The reverse direction is trivial. In (Locatello et al., 2019), the instance factor in Small NORB and the background image factor in Scream-d Sprites are treated as nuisance variables. By Proposition 1, as long as we perform weak supervision on all of the non-nuisance variables (via sharing-pairing, say) to guarantee their consistency with respect to the corresponding true factor of variation, we still have guaranteed full disentanglement despite the existence of nuisance variable and the fact that we cannot measure or perform weak supervision on nuisance variable. Published as a conference paper at ICLR 2020 D SINGLE-FACTOR EXPERIMENTS 0 1 2 3 4 5 Restricted Labeling at Factor i 0 1 2 3 4 5 Normalized Consistency Score at Factor i 0.99 0.37 0.22 0.22 0.19 0.22 0.17 0.99 0.2 0.34 0.23 0.29 0.33 0.24 0.99 0.091 0.14 0.16 -0.0034 0.0034 0.016 1 0.0024 0.0055 0.021 0.0097 0.033 -3.9e-05 1 0.012 -0.003 0.024 0.011 0.012 0.0032 1 0 1 2 3 4 5 Share Pairing at Factor i 0 1 2 3 4 5 Normalized Consistency Score at Factor i 1 0.21 0.38 0.2 0.12 0.36 0.27 1 0.35 0.12 0.27 0.22 0.34 0.23 1 0.2 0.37 0.15 -0.0016 0.0059 0.012 0.99 0.016 0.0075 0.012 0.01 0.018 0.016 0.99 0.0085 0.04 0.065 0.071 0.0017 0.017 1 0 1 2 3 4 5 Change Pairing at Factor i 0 1 2 3 4 5 Normalized Restrictiveness Score at Factor i 1 0.63 0.6 0.68 0.72 0.65 0.67 1 0.73 0.73 0.74 0.77 0.6 0.64 1 0.82 0.74 0.81 0.93 0.93 0.93 1 0.95 0.95 0.84 0.93 0.91 0.93 1 0.91 0.92 0.85 0.83 0.9 0.9 1 0 1 2 3 4 5 Rank Pairing at Factor i 0 1 2 3 4 5 Normalized Consistency Score at Factor i 0.99 0.2 0.39 0.19 0.13 0.17 0.034 0.99 0.23 0.22 0.49 0.47 0.066 0.3 0.99 0.053 0.14 0.045 0.0021 0.01 0.015 0.96 -0.009 0.0035 0.031 0.013 0.022 0.0043 0.98 0.0079 -0.011 0.027 0.023 0.0051 0.0076 0.99 0 1 2 3 4 5 Change Pair Intersection at Factor i 0 1 2 3 4 5 Normalized Restrictiveness Score at Factor i 1 0.98 0.81 0.8 0.8 0.76 0.81 1 0.8 0.78 0.81 0.77 0.95 0.98 1 0.93 0.92 0.91 0.96 0.9 0.94 1 0.82 0.99 0.84 0.84 0.96 0.86 1 0.98 0.77 0.74 0.91 0.8 0.77 1 Figure 7: This is the same plot as Figure 7, but where we restrict our hyperparameter sweep to always set extra dense = False. See Appendix H for details about hyperparameter sweep. 0 1 2 3 4 5 Restricted Labeling at Factor i Normalized Consistency Score for Factor 0 0 1 2 3 4 5 Restricted Labeling at Factor i Normalized Consistency Score for Factor 1 0 1 2 3 4 5 Restricted Labeling at Factor i Normalized Consistency Score for Factor 2 0 1 2 3 4 5 Restricted Labeling at Factor i Normalized Consistency Score for Factor 3 0 1 2 3 4 5 Restricted Labeling at Factor i Normalized Consistency Score for Factor 4 0 1 2 3 4 5 Restricted Labeling at Factor i Normalized Consistency Score for Factor 5 Figure 8: Restricted pairing guarantees consistency. Each plot shows the normalized consistency score of each model for each factor of variation. Our theory predicts each boxplot highlighted in red to achieve the highest consistency. Due to the prevalence of restricted pairing in the existing literature, we chose to only conduct the single-factor restricted labeling experiment on Shapes3D. Published as a conference paper at ICLR 2020 0 1 2 3 4 5 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 0 0 1 2 3 4 5 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 1 0 1 2 3 4 5 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 2 0 1 2 3 4 5 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 3 0 1 2 3 4 5 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 4 0 1 2 3 4 5 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 5 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 0 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 1 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 2 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 3 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 4 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 0 Scream-d Sprites 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 1 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 2 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 3 0 1 2 3 4 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 4 0 1 2 3 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 0 0 1 2 3 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 1 0 1 2 3 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 2 0 1 2 3 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 3 0 1 2 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 0 0 1 2 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 1 0 1 2 Supervision via Change Pairing at Factor i Normalized Restrictiveness Score for Factor 2 Figure 9: Change pairing guarantees restrictiveness. Each plot shows normalized restrictiveness score of each model for each factor of variation (row) across different datasets (columns). Different colors indicate models trained with change pairing on different factors. The appropriatelysupervised model for each factor is marked in red. Published as a conference paper at ICLR 2020 0 1 2 3 4 5 Supervision via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 0 0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 1 0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 2 0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 3 0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 4 0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 5 0 1 2 3 4 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 0 0 1 2 3 4 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 1 0 1 2 3 4 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 2 0 1 2 3 4 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 3 0 1 2 3 4 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 4 0 1 2 3 4 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 0 Scream-d Sprites 0 1 2 3 4 Supervision via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 1 0 1 2 3 4 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 2 0 1 2 3 4 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 3 0 1 2 3 4 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 4 0 1 2 3 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 0 0 1 2 3 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 1 0 1 2 3 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 2 0 1 2 3 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 3 0 1 2 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 0 0 1 2 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 1 0 1 2 Supervisi n via Share Pairing at Fact r i N rmalized C nsistency Sc re f r Fact r 2 Figure 10: Share pairing guarantees consistency. Each plot shows normalized consistency score of each model for each factor of variation (row) across different datasets (columns). Different colors indicate models trained with share pairing on different factors. The appropriately-supervised model for each factor is marked in red. Published as a conference paper at ICLR 2020 0 1 2 3 4 5 Rank Pairi g at Factor i Normalized Co siste cy Score for Factor 0 0 1 2 3 4 5 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 1 0 1 2 3 4 5 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 2 0 1 2 3 4 5 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 3 0 1 2 3 4 5 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 4 0 1 2 3 4 5 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 5 0 1 2 3 4 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 0 0 1 2 3 4 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 1 0 1 2 3 4 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 2 0 1 2 3 4 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 3 0 1 2 3 4 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 4 0 1 2 3 4 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 0 Scream-d Sprites 0 1 2 3 4 Rank Pairi g at Factor i Normalized Co siste cy Score for Factor 1 0 1 2 3 4 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 2 0 1 2 3 4 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 3 0 1 2 3 4 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 4 0 1 2 3 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 0 0 1 2 3 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 1 0 1 2 3 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 2 0 1 2 3 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 3 0 1 2 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 0 0 1 2 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 1 0 1 2 Ra k Pairi g at Factor i Normalized Co siste cy Score for Factor 2 Figure 11: Rank pairing guarantees consistency. Each plot shows normalized consistency score of each model for each factor of variation (row) across different datasets (columns). Different colors indicate models trained with rank pairing on different factors. The appropriately-supervised model for each factor is marked in red. Published as a conference paper at ICLR 2020 E CONSISTENCY VERSUS RESTRICTIVENESS 0.9965 0.9970 0.9975 0.9980 0.9985 0.9990 0.9995 Normalized Restrictiveness Sc re f r Fact r 0 N rmalized C nsistency Sc re f r Fact r 0 Change Share B th 0.995 0.996 0.997 0.998 0.999 1.000 N rmalized Restrictiveness Sc re f r Fact r 1 N rmalized C nsistency Sc re f r Fact r 1 Change Share B th 0.992 0.994 0.996 0.998 1.000 N rmalized Restrictiveness Sc re f r Fact r 2 N rmalized C nsistency Sc re f r Fact r 2 Change Share B th 0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.999 1.000 N rmalized Restrictiveness Sc re f r Fact r 3 N rmalized C nsistency Sc re f r Fact r 3 Change Share B th 0.995 0.996 0.997 0.998 0.999 N rmalized Restrictiveness Sc re f r Fact r 4 N rmalized C nsistency Sc re f r Fact r 4 Change Share B th 0.99500.99550.99600.99650.99700.99750.99800.9985 N rmalized Restrictiveness Sc re f r Fact r 5 N rmalized C nsistency Sc re f r Fact r 5 Change Share B th 0.86 0.88 0.90 0.92 0.94 0.96 0.98 N rmalized Restrictiveness Sc re f r Fact r 0 N rmalized C nsistency Sc re f r Fact r 0 Change Share B th 0.975 0.980 0.985 0.990 0.995 N rmalized Restrictiveness Sc re f r Fact r 1 N rmalized C nsistency Sc re f r Fact r 1 Change Share B th 0.95 0.96 0.97 0.98 0.99 N rmalized Restrictiveness Sc re f r Fact r 2 N rmalized C nsistency Sc re f r Fact r 2 Change Share B th 0.96 0.97 0.98 0.99 1.00 N rmalized Restrictiveness Sc re f r Fact r 3 N rmalized C nsistency Sc re f r Fact r 3 Change Share B th 0.96 0.97 0.98 0.99 1.00 N rmalized Restrictiveness Sc re f r Fact r 4 N rmalized C nsistency Sc re f r Fact r 4 Change Share B th 0.2 0.4 0.6 0.8 N rmalized Restrictiveness Sc re f r Fact r 0 N rmalized C nsistency Sc re f r Fact r 0 Scream-d Sprites Change Share Both 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Restrictiveness Sc re f r Fact r 1 N rmalized C nsistency Sc re f r Fact r 1 Change Share B th 0.0 0.2 0.4 0.6 0.8 1.0 N rmalized Restrictiveness Sc re f r Fact r 2 N rmalized C nsistency Sc re f r Fact r 2 Change Share B th 0.2 0.0 0.2 0.4 0.6 0.8 1.0 N rmalized Restrictiveness Sc re f r Fact r 3 N rmalized C nsistency Sc re f r Fact r 3 Change Share B th 0.2 0.0 0.2 0.4 0.6 0.8 N rmalized Restrictiveness Sc re f r Fact r 4 N rmalized C nsistency Sc re f r Fact r 4 Change Share B th 0.2 0.4 0.6 0.8 1.0 N rmalized Restrictiveness Sc re f r Fact r 0 N rmalized C nsistency Sc re f r Fact r 0 Change Share B th 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 N rmalized Restrictiveness Sc re f r Fact r 1 N rmalized C nsistency Sc re f r Fact r 1 Change Share B th 0.4 0.5 0.6 0.7 0.8 0.9 1.0 N rmalized Restrictiveness Sc re f r Fact r 2 N rmalized C nsistency Sc re f r Fact r 2 Change Share B th 0.5 0.6 0.7 0.8 0.9 N rmalized Restrictiveness Sc re f r Fact r 3 N rmalized C nsistency Sc re f r Fact r 3 Change Share B th 0.88 0.90 0.92 0.94 0.96 0.98 1.00 N rmalized Restrictiveness Sc re f r Fact r 0 N rmalized C nsistency Sc re f r Fact r 0 Change Share B th 0.75 0.80 0.85 0.90 0.95 1.00 N rmalized Restrictiveness Sc re f r Fact r 1 N rmalized C nsistency Sc re f r Fact r 1 Change Share B th 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 N rmalized Restrictiveness Sc re f r Fact r 2 N rmalized C nsistency Sc re f r Fact r 2 Change Share B th Figure 12: Normalized consistency vs. restrictiveness score of different models on each factor (row) across different datasets (columns). In many of the plots, we see that models trained via changesharing (blue) achieve higher restrictiveness; models trained via share-sharing (orange) achieve higher consistency; models trained via both techniques (green) simultaneously achieve restrictiveness and consistency in most cases. Published as a conference paper at ICLR 2020 F FULL DISENTANGLEMENT EXPERIMENTS None Share Change Rank Full-Label Beta VAE Score None Share Change Rank Full-Label Factor VAE Score None Share Change Rank Full-Label Mutual Information Gap None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label Supervision Method None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label 0.0 None Share Change Rank Full-Label 0.0 None Share Change Rank Full-Label None Share Change Rank Full-Label Supervision Method None Share Change Rank Full-Label Scream-d Sprites None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label Supervision Method None Share Change Rank Full-Label None Share Change Rank Full-Label 0.3 None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label Supervision Method None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label Supervision Method Figure 13: Disentanglement performance of a vanilla GAN, share pairing GAN, change pairing GAN, rank pairing GAN, and fully-labeled GAN, as measured by multiple disentanglement metrics in existing literature (rows) across multiple datasets (columns). According to almost all metrics, our weakly supervised models surpass the baseline, and in some cases, even outperform the fully-labeled model. Published as a conference paper at ICLR 2020 None Share Change Rank Full-Label Normalized Consistency Score at Factor 0 None Share Change Rank Full-Label Normalized Consistency Score at Factor 1 None Share Change Rank Full-Label Normalized Consistency Score at Factor 2 None Share Change Rank Full-Label Normalized Consistency Score at Factor 3 None Share Change Rank Full-Label Normalized Consistency Score at Factor 4 None Share Change Rank Full-Label Normalized Consistency Score at Factor 5 None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label Scream-d Sprites None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label 0.0 None Share Change Rank Full-Label Figure 14: Performance of a vanilla GAN (blue), share pairing GAN (orange), change pairing GAN (green), rank pairing GAN (red), and fully-labeled GAN (purple), as measured by normalized consistency score of each factor (rows) across multiple datasets (columns). Factors {3, 4, 5} in the first column shows that distribution matching to all six change / share pairing datasets is particularly challenging for the models when trained on certain hyperparameter choices. However, since consistency and restrictiveness can be measured in weakly supervised settings, it suffices to use these metrics for hyperparameter selection. We see in Figure 16 and Appendix G that using consistency and restrictiveness for hyperparameter selection serves as a viable weakly-supervised surrogate for existing fully-supervised disentanglement metrics. Published as a conference paper at ICLR 2020 None Share Change Rank Full-Label Normalized Restrictiveness Score at Factor 0 None Share Change Rank Full-Label Normalized Restrictiveness Score at Factor 1 None Share Change Rank Full-Label Normalized Restrictiveness Score at Factor 2 None Share Change Rank Full-Label Normalized Restrictiveness Score at Factor 3 None Share Change Rank Full-Label Normalized Restrictiveness Score at Factor 4 None Share Change Rank Full-Label Normalized Restrictiveness Score at Factor 5 None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label 0.5 None Share Change Rank Full-Label Scream-d Sprites None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label 0.0 None Share Change Rank Full-Label 0.0 None Share Change Rank Full-Label 0.1 None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label None Share Change Rank Full-Label 0.775 None Share Change Rank Full-Label None Share Change Rank Full-Label Figure 15: Performance of a vanilla GAN (blue), share pairing GAN (orange), change pairing GAN (green), rank pairing GAN (red), and fully-labeled GAN (purple), as measured by normalized restrictiveness score of each factor (rows) across multiple datasets (columns). Since restrictiveness and consistency are complementary, we see that the anomalies in Figure 14 are reflected in the complementary factors in this figure. Published as a conference paper at ICLR 2020 Beta VAE Score Factor VAE Score Mutual Information Gap 0.0 0.2 0.4 0.6 0.8 1.0 Average Normalized Consistency 0.80 0.85 0.90 0.95 1.00 Average Normalized Restrictiveness (a) Shapes3D Beta VAE Score Factor VAE Score Mutual Information Gap 0.0 0.2 0.4 0.6 0.8 Average Normalized Consistency 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Average Normalized Restrictiveness (b) d Sprites Beta VAE Score Factor VAE Score Mutual Information Gap 0.0 0.2 0.4 0.6 0.8 Average Normalized Consistency 0.2 0.4 0.6 0.8 1.0 Average Normalized Restrictiveness (c) Scream-d Sprites Beta VAE Score Factor VAE Score Mutual Information Gap 0.0 0.2 0.4 0.6 0.8 1.0 Average Normalized Consistency 0.4 0.6 0.8 1.0 Average Normalized Restrictiveness (d) Small NORB Beta VAE Score Factor VAE Score Mutual Information Gap 0.2 0.4 0.6 0.8 1.0 Average Normalized Consistency 0.5 0.6 0.7 0.8 0.9 1.0 Average Normalized Restrictiveness Figure 16: Scatterplot of existing disentanglement metrics versus average normalized consistency and restrictiveness. Whereas existing disentanglement metrics are fully-supervised, it is possible to measure average normalized consistency and restrictiveness with weakly supervised data (sharepairing and match-pairing respectively), making it viable to perform hyperparameter tuning under weakly supervised conditions. Published as a conference paper at ICLR 2020 G FULL DISENTANGLEMENT VISUALIZATIONS As a demonstration of the weakly-supervised generative models, we visualize our best-performing match-pairing generative models (as selected according to the normalized consistency score averaged across all the factors). Recall from Figures 2a to 2c that, to visually check for consistency and restrictiveness, it is important that we not only ablate a single factor (across the column), but also show that the factor stays consistent (down the row). Each block of 3 12 images in Figures 17 to 21 checks for disentanglement of the corresponding factor. Each row is constructed by random sampling of Z\i and then ablating Zi. Figure 17: Cars3D. Ground truth factors: elevation, azimuth, object type. Published as a conference paper at ICLR 2020 Figure 18: Shapes3D. Ground truth factors: floor color, wall color, object color, object size, object type, and azimuth. Published as a conference paper at ICLR 2020 Figure 19: d Sprites. Ground truth factors: shape, scale, orientation, X-position, Y-position. Published as a conference paper at ICLR 2020 Figure 20: Scream-d Sprites. Ground truth factors: shape, scale, orientation, X-position, Y-position. Published as a conference paper at ICLR 2020 Figure 21: Small NORB. Ground truth factors: category, elevation, azimuth, lighting condition. Published as a conference paper at ICLR 2020 H HYPERPARAMETERS Table 1: We trained a probablistic Gaussian encoder to approximately invert the generative model. The encoder is not trained jointly with the generator, but instead trained separately from the generative model (i.e. encoder gradient does not backpropagate to generative model). During training, the encoder is only exposed to data generated by the learned generative model. 4 4 spectral norm conv. 32. l Re LU 4 4 spectral norm conv. 32. l Re LU 4 4 spectral norm conv. 64. l Re LU 4 4 spectral norm conv. 64. l Re LU flatten 128 spectral norm dense. l Re LU 2 z-dim spectral norm dense Table 2: Generative model architecture. 128 dense. Re LU. batchnorm. 1024 dense. Re LU. batchnorm. 4 4 64 reshape. 4 4 conv. 64. l Re LU. batchnorm. 4 4 conv. 32. l Re LU. batchnorm. 4 4 conv. 32. l Re LU. batchnorm. 4 4 conv. 3. sigmoid Table 3: Discriminator used for restricted labeling. Parts in red are part of hyperparameter search. Discriminator Body 4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU flatten if extra dense: 128 width spectral norm dense. l Re LU Discriminator Auxiliary Channel for Label 128 width spectral norm dense. l Re LU If extra dense: 128 width spectral norm dense. l Re LU Discriminator head concatenate body and auxiliary. 128 width spectral norm dense. l Re LU 128 width spectral norm dense. l Re LU 1 spectral norm dense with bias. Published as a conference paper at ICLR 2020 Table 4: Discriminator used for match pairing. We use a projection discriminator (Miyato & Koyama, 2018) and thus have an unconditional and conditional head. Parts in red are part of hyperparameter search. Discriminator Body Applied Separately to x and x 4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU flatten If extra dense: 128 width spectral norm dense. l Re LU concatenate the pair. 128 width spectral norm dense. l Re LU 128 width spectral norm dense. l Re LU Unconditional Head 1 spectral norm dense with bias Conditional Head 128 width spectral norm dense Table 5: Discriminator used for rank pairing. For rank-pairing, we use a special variant of the projection discriminator, where the conditional logit is computed via taking the difference between the two pairs and multiplying by y { 1, +1}. The discriminator is thus implicitly taking on the role of an adversarially trained encoder that checks for violations of the ranking rule in the embedding space. Parts in red are part of hyperparameter search. Discriminator Body Applied Separately to x and x 4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU flatten If extra dense: 128 width spectral norm dense. l Re LU concatenate the pair. Unconditional Head Applied Separately to x and x 1 spectral norm dense with bias. Conditional Head Applied Separately to x and x y-dim spectral norm dense. Published as a conference paper at ICLR 2020 For all models, we use the Adam optimizer with β1 = 0.5, β2 = 0.999 and set the generator learning rate to 1 10 3. We use a batch size of 64 and set the leaky Re LU negative slope to 0.2. To demonstrate some degree of robustness to hyperparameter choices, we considered five different ablations: 1. Width multiplier on the discriminator network ({1, 2}) 2. Whether to add an extra fully-connected layer to the discriminator ({True, False}). 3. Whether to add a bias term to the head ({True, False}). 4. Whether to use two-time scale learning rate by setting encoder+discriminator learning rate multipler to ({1, 2}). 5. Whether to use the default Py Torch or Keras initialization scheme in all models. As such, each of our experimental setting trains a total of 32 distinct models. The only exception is the intersection experiments where we fixed the width multiplier to 1. To give a sense of the scale of our experimental setup, note that the 864 models in Figure 4 originate as follows: 1. 32 hyperparameter conditions 6 restricted labeling conditions. 2. 32 hyperparameter conditions 6 match pairing conditions. 3. 32 hyperparameter conditions 6 share pairing conditions. 4. 32 hyperparameter conditions 6 rank pairing conditions. 5. 16 hyperparameter conditions 6 intersection conditions. Published as a conference paper at ICLR 2020 I.1 ASSUMPTIONS ON H Assumption 1. Let D [n] indexes discrete random variables SD. Assume that the remaining random variables SC = S\D have probability density function p(s C|s D) for any set of values s D where p(SD = s D) > 0. Assumption 2. Without loss of generality, suppose S1:n = [SC, SD] is ordered by concatenating the continuous variables with the discrete variables. Let B(s D) = [int(supp(p(s C | s D))), s D] denote the interior of the support of the continuous conditional distribution of SC concatenated with its conditioning variable s D drawn from SD. With a slight abuse of notation, let B(S) = S s D:p(s D>0) B(s D). We assume B(S) is zig-zag connected, i.e., for any I, J [n], for any two points s1:n, s 1:n B(S) that only differ in coordinates in I J, there exists a path {st 1:n}t=0:T contained in B(S) such that s0 1:n = s1:n (13) s T 1:n = s 1:n (14) 0 t < T, either st \I = st+1 \I or st \J = st+1 \J , (15) Intuitively, this assumption allows transition from s1:n to s 1:n via a series of modifications that are only in I or only in J. Note that zig-zag connectedness is necessary for restrictiveness union (Proposition 3) and consistency intersection (Proposition 4). Fig. 22 gives examples where restrictiveness union is not satisfied when zig-zag connectedness is violated. Assumption 3. For arbitrary coordinate j [m] of g that maps to a continuous variable Xj, we assume that gj(s) is continuous at s, s B(S); For arbitrary coordinate j [m] of g that maps to a discrete variable Xj, s D where p(s D) > 0, we assume that gj(s) is constant over each connected component of int(supp(p(s C | s D)). Define B(X) analogously to B(S). Symmetrically, for arbitrary coordinate i [n] of e that maps to a continuous variable Si, we assume that ei(x) is continuous at x, x B(X); For arbitrary coordinate i [n] of e that maps to a discrete Si, x D where p(x D) > 0, we assume that ei(x) is constant over each connected component of int(supp(p(x C | x D)). Assumption 4. Assume that every factor of variation is recoverable from the observation X. Formally, (p, g, e) satisfies the following property Ep(s1:n) e g(s1:n) s1:n 2 = 0. (16) I.2 CALCULUS OF DISENTANGLEMENT I.2.1 EXPECTED-NORM REDUCTION LEMMA Lemma 1. Let x, y be two random variables with distribution p, f(x, y) be arbitrary function. Then Ex p(x)Ey,y p(y|x) f(x, y) f(x, y ) 2 E(x,y),(x ,y ) p(x,y) f(x, y) f(x , y ) 2. Proof. Assume w.l.o.g that E(x,y) p(x,y)f(x, y) = 0. LHS = 2E(x,y) p(x,y) f(x, y) 2 2Ex p(x)Ey,y p(y|x)f(x, y)T f(x, y ) (17) = 2E(x,y) p(x,y) f(x, y) 2 2Ex p(x)Ey p(y|x)f(x, y)T Ey p(y|x)f(x, y ) (18) = 2E(x,y) p(x,y) f(x, y) 2 2Ex p(x) Ey p(y|x)f(x, y) 2 (19) 2E(x,y) p(x,y) f(x, y) 2 (20) = 2E(x,y) p(x,y) f(x, y) 2 2 E(x,y) p(x,y)f(x, y) 2 (21) = 2E(x,y) p(x,y) f(x, y) 2 2E(x,y),(x ,y ) p(x,y)f(x, y)T f(x , y ) (22) = RHS. (23) Published as a conference paper at ICLR 2020 I.2.2 CONSISTENCY UNION Let L = I J,K = \ (I J), M = I L,N = J L. Proposition 2. C(I) C(J) = C(I J). C(I) = Ez M,z LEz N,z N,z K,z K r I G(z M, z L, z N, z K) r I G(z M, z L, z N, z K) 2 = 0. (24) For any fixed value of z M, z L, Ez N,z N,z K,z K r I G(z M, z L, z N, z K) r I G(z M, z L, z N, z K) 2 (25) Ez N Ez K,z K r I G(z M, z L, z N, z K) r I G(z M, z L, z N, z K) 2. (26) by plugging in x = z N, y = z K into Lemma 1. Therefore C(I) = Ez M,z L,z N Ez K,z K r I G(z M, z L, z N, z K) r I G(z M, z L, z N, z K) 2 = 0. (27) Similarly we have C(J) = Ez M,z L,z N Ez K,z K r J G(z M, z L, z N, z K) r J G(z M, z L, z N, z K) 2 = 0 (28) = Ez M,z L,z N Ez K,z K r N G(z M, z L, z N, z K) r N G(z M, z L, z N, z K) 2 = 0. (29) As I N = , I N = I J, adding the above two equations gives us C(I J). I.2.3 RESTRICTIVENESS UNION Figure 22: Zig-zag connectedness is necessary for restriveness union. Here n = m = 3. Colored areas indicate the support of p(z1, z2); the marked numbers indicate the measurement of s3 given (z1, z2). Left two panels satisfy zig-zag connectedness (the paths are marked in gray) while the right two do not (indeed R(1) R(2) R({1, 2})). In the right-most panel, any zig-zag path connecting two points from blue and orange areas has to pass through boundary of the support (disallowed). Similarly define index sets L, K, M, N. Proposition 3. Under assumptions specified in Appendix I.1, R(I) R(J) = R(I J). Proof. Denote f = e K g. We claim that R(I) Ez\IEz I,z I f(z I, z\I) f(z I, z\I) 2 = 0. (30) (z I, z\I), (z I, z\I) B(Z), f(z I, z\I) = f(z I, z\I). (31) We first prove the backward direction: When we draw z\I p(z\I), z I, z I p(z I|z\I), let E1 denote the event that (z I, z\I) / B(Z), and E2 denote the event that (z I, z\I) / B(Z). Reorder the indices of (z I, z\I) as (z C, z D). The probability that (z I, z\I) / B(Z) (i.e.,z C is on the boundary of B(z D)) is 0. Therefore Pr[E1] = Pr[E2] = 0. Therefore Pr[E1 E2] Pr[E1] + Pr[E2] = 0, i.e., with probability 1, f(z I, z\I) f(z I, z\I) 2 = 0. Published as a conference paper at ICLR 2020 Now we prove the forward direction: Assume for the sake of contradiction that (z I, z\I), (z I, z\I) B(Z) such that f(z I, z\I) < f(z I, z\I). Denote U = I D, V = I C, W = \I D, Q = \I C. We have f(z U, z V , z W , z Q) < f(z U, z V , z W , z Q). Since f is continuous (or constant) at (z U, z V , z W , z Q) in the interior of B([z U, z W ]), and f is also continuous (or constant) at (z U, z V , z W , z Q) in the interior of B([z U, z W ]), we can draw open balls of radius r > 0 around each point, i.e., Br(z V , z Q) B([z U, z W ]) and Br(z V , z Q) B([z U, z W ]), where (z V , z Q) Br(z V , z Q), (z V , z Q) Br(z V , z Q), f(z U, z V , z W , z Q) < f(z U, z V , z W , z Q). (32) When we draw z\I p(z\I), z I, z I p(z I|z\I), let C denote the event that (z I, z\I) = (z V , z U, z# Q, z W ), (z I, z\I) = (z V , z U, z# Q, z W ) where (z V , z# Q) Br(z V , z Q) and (z V , z# Q) Br(z V , z Q). Since both balls have positive volume, Pr[C] > 0. However, f(z I, z\I) f(z I, z\I) 2 > 0 whenever event C happens, which contradicts R(I). Therefore (z I, z\I), (z I, z\I) B(Z), f(z I, z\I) = f(z I, z\I). We have shown that R(I) (z M, z L, z N, z K), (z M, z L, z N, z K) B(Z), f(z M, z L, z N, z K) = f(z M, z L, z N, z K). (33) R(J) (z M, z L, z N, z K), (z M, z L, z N, z K) B(Z), f(z M, z L, z N, z K) = f(z M, z L, z N, z K). (34) R(I J) (z M, z L, z N, z K), (z M, z L, z N, z K) B(Z), f(z M, z L, z N, z K) = f(z M, z L, z N, z K) (35) Let the zig-zag path between (z M, z L, z N, z K) and (z M, z L, z N, z K) B(Z) be {(zt M, zt L, zt N, z K)}T t=0. Repeatedly applying the equivalent conditions of R(I) and R(J) gives us f(z M, z L, z N, z K) = f(z1 M, z1 L, z1 N, z K) = = f(z T 1 M , z T 1 L , z T 1 N , z K) = f(z M, z L, z N, z K). (36) I.3 CONSISTENCY AND RESTIVENESS INTERSECTION Proposition 4. Under the same assumptions as restrictiveness union, C(I) C(J) = C(I J). C(I) C(J) = R(\I) R(\J) (37) = R(\I \J) (38) = C(\(\I \J)) (39) = C(I J). (40) Proposition 5. R(I) R(J) = R(I J). Proof is analogous to Proposition 4. I.4 DISTRIBUTION MATCHING GUARANTEES LATENT CODE INFORMATIVENESS Proposition 6. If (p , g , e ) H, and (p, g, e) H, and g (S) d= g(Z), then there exists a continuous function r such that Ep(s1:n) r e g (s) s = 0. (41) Published as a conference paper at ICLR 2020 Proof. We show that r = e g satisfies Proposition 6. By Assumption 4, Es e g (s) s 2 = 0. (42) Ez e g(z) z 2 = 0. (43) By the same reasoning as in the proof of Proposition 3, Es e g (s) s 2 = 0 = s B(S), e g (s) = s. (44) Ez e g(z) z 2 = 0 = z B(Z), e g(z) = z. (45) Let s p(s). We claim that Pr[E1] = 1, where E1 denote the event that z B(Z) such that g (s) = g(z). Suppose to the contrary that there is a measure-non-zero set S supp(p(s)) such that s S, no z B(Z) satisfies g (s) = g(z). Let X = {g(s) : s S}. As g (S) d= g(Z), Prs[g (s) X] = Prz[g(z) X] > 0. Therefore Z supp(p(z)) B(Z) such that X {g(z) : z Z)}. But supp(p(z)) B(Z) has measure 0. Contradiction. When we draw s, let E2 denote the event that s B(S). Pr[E2] = 1, so Pr[E1 E2] = 1. When E1 E2 happens, e g e g (s) = e g e g(z) = e g(z) = e g (s) = s. Therefore Es e g e g (s) s = 0. (46) I.5 WEAK SUPERVISION GUARANTEE Theorem 1. Given any oracle (p (s), g , e ) H, consider the distribution-matching algorithm A that selects a model (p(z), g, e) H such that: 1. (g (S), SI) d= (g(Z), ZI) (Restricted Labeling); or g (SI, S\I), g (SI, S \I) d= g(ZI, Z\I), g(ZI, Z \I) (Match Pairing); or 3. (g (S), g (S ), 1 {SI S I}) d= (g(Z), g(Z ), 1 {ZI Z I}) (Rank Pairing). Then (p, g) satisfies C(I ; p, g, e ) and e satisfies C(I ; p , g , e). Proof. We prove the three cases separately: 1. Since (xd, s I) d= (xg, z I), consider the measurable function f(a, b) = e I(a) b 2. (47) E e I(xd) s I 2 = E e I(xg) z I 2 = 0. (48) By the same reasoning as in the proof of Proposition 3, Ez e I g(z) z I 2 = 0 = z B(Z), e I g(z) = z I. (49) Ez IEz\I,z \I e I g(z I, z\I) e I g(z I, z \I) 2 = 0. (50) i.e., g satisfies C(I ; p, g, e ). By symmetry, e satisfies C(I ; p , g , e). 2. g (SI, S\I), g (SI, S \I) d= g(ZI, Z\I), g(ZI, Z \I) (51) = e I g (SI, S\I) e I g (SI, S \I) 2 d= e I g(ZI, Z\I) e I g(ZI, Z \I) 2 = Ez IEz\I,z \I e I g(z I, z\I) e I g(z I, z \I) 2 = 0. (53) So g satisfies C(I ; p, g, e ). By symmetry, e satisfies C(I ; p , g , e). Published as a conference paper at ICLR 2020 3. Let I = {i}, f = e I g. Distribution matching implies that, with probability 1 over random draws of Z, Z , the following event P happens: ZI <= Z I = f(Z) <= f(Z ). (54) Ez,z 1[ P] = 0. (55) Let W = \I D, Q = \I \D. We showed in the proof of Proposition 3 that C(I) (z I, z W , z Q), (z I, z W , z Q) B(Z), f(z I, z W , z Q) = f(z I, z W , z Q). (56) We prove by contradiction. Suppose (z I, z W , z Q), (z I, z W , z Q) B(Z) such that f(z I, z W , z Q) < f(z I, z W , z Q). (a) Case 1: ZI is discrete. Since f is constant both at (z I, z W , z Q) in the interior of B([z I, z W ]), and at (z I, z W , z Q) in the interior of B([z I, z W ]), we can draw open balls of radius r > 0 around each point, i.e., Br(z Q) B([z I, z W ]) and Br(z Q) B([z I, z W ]), where z Q Br(z Q), z Q Br(z Q), f(z I, z W , z Q) < f(z I, z W , z Q). (57) When we draw z, z p(z), let C denote the event that this specific value of z I is picked for both z, z , and we picked z\I Br(z Q), z \I Br(z Q). Since both balls have positive volume, Pr[C] > 0. However, P does not happen whenever event C happens, since z I = z I but f(z) > f(z ), which contradicts Pr[P] = 1. (b) Case 2: z I is continuous. Similar to case 1, we can draw open balls of radius r > 0 around each point, i.e., Br(z I, z Q) B(z W ) and Br(z I, z Q) B(z W ), where (z I, z Q) Br(z I, z Q), (z I , z Q) Br(z I, z Q), f(z I, z W , z Q) < f(z I , z W , z Q). (58) Let H1 = {(z I, z Q) Br(z I, z Q) : z I >= z I}, H2 = {(z I , z Q) Br(z I, z Q) : z I <= z I}. When we draw z, z p(z), let C denote the event that we picked z H1 {z W }, z H2 {z W }. Since H1, H2 have positive volume, Pr[C] > 0. However, P does not happen whenever event C happens, since z I <= z I but f(z) > f(z ), which contradicts Pr[P] = 1. Therefore we showed (z I, z W , z Q), (z I, z W , z Q) B(Z), f(z I, z W , z Q) = f(z I, z W , z Q), (59) i.e., g satisfies C(I ; p, g, e ). By symmetry, e satisfies C(I ; p , g , e). I.6 WEAK SUPERVISION IMPOSSIBILITY RESULT Theorem 2. Weak supervision via restricted labeling, match pairing, or ranking on s I is not sufficient for learning a generative model whose latent code ZI is restricted to SI. Proof. We construct the following counterexample. Let n = m = 3 and I = {1}. The data generation process is s1 unif([0, 2π)), (s2, s3) unif({(x, y) : x2 + y2 1}), g (s) = [s1, s2, s3]. Consider a generator z1 unif([0, 2π)), (s2, s3) unif({(x, y) : x2 + y2 1}), g(z) = [z1, cos (z1)z2 sin (z1)z3, sin (z1)z2 + cos (z1)z3]. Then (xd, s I) d= (xg, z I) but R(I; p, g, e ) and R(I ; p , g , e) does not hold. The same counterexample is applicable for match pairing and rank pairing.