# weakly_supervised_disentanglement_with_guarantees__db7adb6d.pdf

Published as a conference paper at ICLR 2020

WEAKLY SUPERVISED DISENTANGLEMENT WITH GUARANTEES

Rui Shu , Yining Chen , Abhishek Kumar , Stefano Ermon & Ben Poole

Stanford University, Google Brain {ruishu,cynnjjs,ermon}@stanford.edu {abhishk,pooleb}@google.com

Learning disentangled representations that correspond to factors of variation in real-world data is critical to interpretable and human-controllable machine learning. Recently, concerns about the viability of learning disentangled representations in a purely unsupervised manner has spurred a shift toward the incorporation of weak supervision. However, there is currently no formalism that identiﬁes when and how weak supervision will guarantee disentanglement. To address this issue, we provide a theoretical framework to assist in analyzing the disentanglement guarantees (or lack thereof) conferred by weak supervision when coupled with learning algorithms based on distribution matching. We empirically verify the guarantees and limitations of several weak supervision methods (restricted labeling, match-pairing, and rank-pairing), demonstrating the predictive power and usefulness of our theoretical framework.

1 INTRODUCTION

Many real-world datasets can be intuitively described via a data-generating process that ﬁrst samples an underlying set of interpretable factors, and then conditional on those factors generates an observed data point. For example, in image generation, one might ﬁrst sample the object identity and pose, and then render an image with the object in the correct pose. The goal of disentangled representation learning is to learn a representation where each dimension of the representation corresponds to a distinct factor of variation in the dataset (Bengio et al., 2013). Learning such representations that align with the underlying factors of variation may be critical to the development of machine learning models that are explainable or human-controllable (Gilpin et al., 2018; Lee et al., 2019; Klys et al., 2018).

In recent years, disentanglement research has focused on the learning of such representations in an unsupervised fashion, using only independent samples from the data distribution without access to the true factors of variation (Higgins et al., 2017; Chen et al., 2018a; Kim & Mnih, 2018; Esmaeili et al., 2018). However, Locatello et al. (2019) demonstrated that many existing methods for the unsupervised learning of disentangled representations are brittle, requiring careful supervision-based hyperparameter tuning. To build robust disentangled representation learning methods that do not require large amounts of supervised data, recent work has turned to forms of weak supervision (Chen & Batmanghelich, 2019; Gabbay & Hoshen, 2019). Weak supervision can allow one to build models that have interpretable representations even when human labeling is challenging (e.g., hair style in face generation, or style in music generation). While existing methods based on weaklysupervised learning demonstrate empirical gains, there is no existing formalism for describing the theoretical guarantees conferred by different forms of weak supervision (Kulkarni et al., 2015; Reed et al., 2015; Bouchacourt et al., 2018).

In this paper, we present a comprehensive theoretical framework for weakly supervised disentanglement, and evaluate our framework on several datasets. Our contributions are several-fold.

1. We formalize weakly-supervised learning as distribution matching in an extended space.

Work done during an internship at Google Brain.

Published as a conference paper at ICLR 2020

2. We propose a set of deﬁnitions for disentanglement that can handle correlated factors and are inspired by many existing deﬁnitions in the literature (Higgins et al., 2018; Suter et al., 2018; Ridgeway & Mozer, 2018).

3. Using these deﬁnitions, we provide a conceptually useful and theoretically rigorous calculus of disentanglement.

4. We apply our theoretical framework of disentanglement to analyze three notable classes of weak supervision methods (restricted labeling, match pairing, and rank pairing). We show that although certain weak supervision methods (e.g., style-labeling in style-content disentanglement) do not guarantee disentanglement, our calculus can determine whether disentanglement is guaranteed when multiple sources of weak supervision are combined.

5. Finally, we perform extensive experiments to systematically and empirically verify our predicted guarantees.1

2 FROM UNSUPERVISED TO WEAKLY SUPERVISED DISTRIBUTION MATCHING

Our goal in disentangled representation learning is to identify a latent-variable generative model whose latent variables correspond to ground truth factors of variation in the data. To identify the role that weak supervision plays in providing guarantees on disentanglement, we ﬁrst formalize the model families we are considering, the forms of weak supervision, and ﬁnally the metrics we will use to evaluate and prove components of disentanglement.

We consider data-generating processes where S Rn are the factors of variation, with distribution p (s), and X Rm is the observed data point which is a deterministic function of S, i.e., X = g (S). Many existing algorithms in unsupervised learning of disentangled representations aim to

learn a latent-variable model with prior p(z) and generator g, where g(Z) d= g (S). However, simply matching the marginal distribution over data is not enough: the learned latent variables Z and the true generating factors S could still be entangled with each other (Locatello et al., 2019).

To address the failures of unsupervised learning of disentangled representations, we leverage weak supervision, where information about the data-generating process is conveyed through additional observations. By performing distribution matching on an augmented space (instead of just on the observation X), we can provide guarantees on learned representations.

Restricted Labeling

Match Pairing

Rank Pairing

Unsupervised

Figure 1: Augmented data distributions derived from weak supervision. Shaded nodes denote observed quantities, and unshaded nodes represent unobserved (latent) variables.

We consider three practical forms of weak supervision: restricted labeling, match pairing, and rank pairing. All of these forms of supervision can be thought of as augmented forms of the original joint distribution, where we partition the latent variables in two S = (SI, S\I), and either observe a subset of the latent variables or share latents between multiple samples. A visualization of these augmented distributions is presented in Figure 1, and below we detail each form of weak supervision.

In restricted labeling, we observe a subset of the ground truth factors, SI in addition to X. This allows us to perform distribution matching on p (s I, x), the joint distribution over data and observed factors, instead of just the data, p (x), as in unsupervised learning. This form of supervision is often leveraged in style-content disentanglement, where labels are available for content but not style (Kingma et al., 2014; Narayanaswamy et al., 2017; Chen et al., 2018b; Gabbay & Hoshen, 2019).

1Code available at https://github.com/google-research/google-research/tree/master/weak disentangle

Published as a conference paper at ICLR 2020

Match Pairing uses paired data, (x, x ) that share values for a known subset of factors, I. For many data modalities, factors of variation may be difﬁcult to explicitly label. Instead, it may be easier to collect pairs of samples that share the same underlying factor (e.g., collecting pairs of images of different people wearing the same glasses is easier than deﬁning labels for style of glasses). Match pairing is a weaker form of supervision than restricted labeling, as the learning algorithm no longer depends on the underlying value s I, and only on the indices of shared factors I. Several variants of match pairing have appeared in the literature (Kulkarni et al., 2015; Bouchacourt et al., 2018; Ridgeway & Mozer, 2018), but typically focus on groups of observations in contrast to the paired setting we consider in this paper.

Rank Pairing is another form of paired data generation where the pairs (x, x ) are generated in an i.i.d. fashion, and an additional indicator variable y is observed that determines whether the corresponding latent si is greater than s i: y = 1 {si s i}. Such a form of supervision is effective when it is easier to compare two samples with respect to an underlying factor than to directly collect labels (e.g., comparing two object sizes versus providing a ruler measurement of an object). Although supervision via ranking features prominently in the metric learning literature (Mc Fee & Lanckriet, 2010; Wang et al., 2014), our focus in this paper will be on rank pairing in the context of disentanglement guarantees.

For each form of weak supervision, we can train generative models with the same structure as in Figure 1, using data sampled from the ground truth model and a distribution matching objective. For example, for match pairing, we train a generative model (p(z), g) such that the paired random variable (g(ZI, Z\I), g(ZI, Z \I)) from the generator matches the distribution of the corresponding paired random variable (g (SI, S\I), g (SI, S \I)) from the augmented data distribution.

3 DEFINING DISENTANGLEMENT

To identify the role that weak supervision plays in providing guarantees on disentanglement, we introduce a set of deﬁnitions that are consistent with our intuitions about what constitutes disentanglement and amenable to theoretical analysis. Our new deﬁnitions decompose disentanglement into two distinct concepts: consistency and restrictiveness. Different forms of weak supervision can enable consistency or restrictiveness on subsets of factors, and in Section 4 we build up a calculus of disentanglement from these primitives. We discuss the relationship to prior deﬁnitions of disentanglement in Appendix A.

3.1 DECOMPOSING DISENTANGLEMENT INTO CONSISTENCY AND RESTRICTIVENESS

(a) Disentanglement

(b) Consistency

(c) Restrictiveness

Figure 2: Illustration of disentanglement, consistency, and restrictiveness of z1 with respect to the factor of variation size. Each image of a shape represents the decoding g(z1:3) by the generative model. Each column denotes a ﬁxed choice of z1. Each row denotes a ﬁxed choice of (z2, z3). A demonstration of consistency versus restrictiveness on models from disentanglement lib is available in Appendix B.

To ground our discussion of disentanglement, we consider an oracle that generates shapes with factors of variation for size (S1), shape (S2), and color (S3). How can we determine whether Z1 of our generative model disentangles the concept of size? Intuitively, one way to check whether Z1 of the generative model disentangles size (S1) is to visually inspect what happens as we vary Z1, Z2, and Z3, and see whether the resulting visualizations are consistent with Figure 2a. In doing so, our visual inspection checks for two properties:

Published as a conference paper at ICLR 2020

1. When Z1 is ﬁxed, the size (S1) of the generated object never changes. 2. When only Z1 is changed, the change is restricted to the size (S1) of the generated object, meaning that there is no change in Sj for j = 1.

We argue that disentanglement decomposes into these two properties, which we refer to as generator consistency and generator restrictiveness. Next, we formalize these two properties.

Let H be a hypothesis class of generative models from which we assume the true data-generating function is drawn. Each element of the hypothesis class H is a tuple (p(s), g, e), where p(s) describes the distribution over factors of variation, the generator g is a function that maps from the factor space S Rn to the observation space X Rm, and the encoder e is a function that maps from X S. S and X can consist of both discrete and continuous random variables. We impose a few mild assumptions on H (see Appendix I.1). Notably, we assume every factor of variation is exactly recoverable from the observation X, i.e. e(g(S)) = S.

Given an oracle model h = (p , g , e ) H, we would like to learn a model h = (p, g, e) H whose latent variables disentangle the latent variables in h . We refer to the latent-variables in the oracle h as S and the alternative model h s latent variables as Z. If we further restrict h to only those models where g(Z) d= g (S) are equal in distribution, it is natural to align Z and S via S = e g(Z). Under this relation between Z and S, our goal is to construct deﬁnitions that describe whether the latent code Zi disentangles the corresponding factor Si.

Generator Consistency. Let I denote a set of indices and p I denote the generating process

z I p(z I) (1)

z\I, z \I iid p(z\I | z I). (2)

This generating process samples ZI once and then conditionally samples ZI twice in an i.i.d. fashion. We say that ZI is consistent with SI if

Ep I e I g(z I, z\I) e I g(z I, z \I) 2 = 0, (3)

where e I is the oracle encoder restricted to the indices I.

Intuitively, Equation (3) states that, for any ﬁxed choice of ZI, resampling of Z\I will not inﬂuence the oracle s measurement of the factors SI. In other words, SI is invariant to changes in Z\I. An illustration of a generative model where Z1 is consistent with size (S1) is provided in Figure 2b. A notable property of our deﬁnition is that the prescribed sampling process p I does not require the underlying factors of variation to be statistically independent. We characterize this property in contrast to previous deﬁnitions of disentanglement in Appendix A.

Generator Restrictiveness. Let p\I denote the generating process

z\I p(z\I) (4)

z I, z I iid p(z I | z\I). (5)

We say that ZI is restricted to SI if

Ep\I e \I g(z I, z\I) e \I g(z I, z\I) 2 = 0. (6)

Equation (6) states that, for any ﬁxed choice of Z\I, resampling of ZI will not inﬂuence the oracle s measurement of the factors S\I. In other words, S\I is invariant to changes in ZI. Thus, changing ZI is restricted to modifying only SI. An illustration of a generative model where Z1 is restricted to size (S1) is provided in Figure 2c.

Generator Disentanglement. We now say that ZI disentangles SI if ZI is consistent with and restricted to SI. If we denote consistency and restrictiveness via Boolean functions C(I) and R(I), we can now concisely state that

D(I) := C(I) R(I), (7)

where D(I) denotes whether ZI disentangles SI. An illustration of a generative model where Z1 disentangles size (S1) is provided in Figure 2a. Note that while size increases monotonically with Z1 in the schematic ﬁgure, we wish to clarify that monotonicity is unrelated to the concepts of consistency and restrictiveness.

Published as a conference paper at ICLR 2020

3.2 RELATION TO BIJECTIVITY-BASED DEFINITION OF DISENTANGLEMENT

Under our mild assumptions on H, distribution matching on g(Z) d= g(S) combined with generator disentanglement on factor I implies the existence of two invertible functions f I and f\I such that the alignment via S = e g(Z) decomposes into SI S\I

= f I(ZI) f\I(Z\I)

This expression highlights the connection between disentanglement and invariance, whereby SI is only inﬂuenced by ZI, and S\I is only inﬂuenced by Z\I. However, such a bijectivity-based deﬁnition of disentanglement does not naturally expose the underlying primitives of consistency and restrictiveness, which we shall demonstrate in our theory and experiments to be valuable concepts for describing disentanglement guarantees under weak supervision.

3.3 ENCODER-BASED DEFINITIONS FOR DISENTANGLEMENT

Our proposed deﬁnitions are asymmetric measuring the behavior of a generative model against an oracle encoder. So far, we have chosen to present the deﬁnitions from the perspective of a learned generator (p, g) measured against an oracle encoder e . In this sense, they are generator-based deﬁnitions. We can also develop a parallel set of deﬁnitions for encoder-based consistency, restrictiveness, and disentanglement within our framework simply by using an oracle generator (p , g ) measured against a learned encoder e. Below, we present the encoder-based perspective on consistency.

Encoder Consistency. Let p I denote the generating process

s I p (s I) (9)

s\I, s \I iid p (s\I, | s I). (10)

This generating process samples SI once and then conditionally samples SI twice in an i.i.d. fashion. We say that SI is consistent with ZI if

Ep I e I g (s I, s\I) e I g (s I, s \I) 2 = 0. (11)

We now make two important observations. First, a valuable trait of our encoder-based deﬁnitions is that one can check for encoder consistency / restrictiveness / disentanglement as long as one has access to match pairing data from the oracle generator. This is in contrast to the existing disentanglement deﬁnitions and metrics, which require access to the ground truth factors (Higgins et al., 2017; Kumar et al., 2018; Kim & Mnih, 2018; Chen et al., 2018a; Suter et al., 2018; Ridgeway & Mozer, 2018; Eastwood & Williams, 2018). The ability to check for our deﬁnitions in a weakly supervised fashion is the key to why we can develop a theoretical framework using the language of consistency and restrictiveness. Second, encoder-based deﬁnitions are tractable to measure when testing on synthetic data, since the synthetic data directly serves the role of the oracle generator. As such, while we develop our theory to guarantee both generator-based and the encoder-based disentanglement, all of our measurements in the experiments will be conducted with respect to a learned encoder.

We make three remarks on notations. First, D(i) := D({i}). Second, D( ) evaluates to true. Finally, D(I) is implicitly dependent on either (p, g, e ) (generator-based) or (p , g , e) (encoderbased). Where important, we shall make this dependency explicit (e.g., let D(I ; p, g, e ) denote generator-based disentanglement). We apply these conventions to C and R analogously.

4 A CALCULUS OF DISENTANGLEMENT

There are several interesting relationships between restrictiveness and consistency. First, by deﬁnition, C(I) is equivalent to R(\I). Second, we can see from Figures 2b and 2c that C(I) and R(I) do not imply each other. Based on these observations and given that consistency and restrictiveness operate over subsets of the random variables, a natural question that arises is whether consistency or restrictiveness over certain sets of variables imply additional properties over other sets of variables.

Published as a conference paper at ICLR 2020

We develop a calculus for discovering implied relationships between learned latent variables Z and ground truth factors of variation S given known relationships as follows.

Calculus of Disentanglement

Consistency and Restrictiveness C(I) = R(I) R(I) = C(I) C(I) R(\I)

Union Rules C(I) C(J) = C(I J) R(I) R(J) = R(I J)

Intersection Rules C(I) C(J) = C(I J) R(I) R(J) = R(I J)

Full Disentanglement Vn i=1 C(i) Vn i=1 D(i) Vn i=1 R(i) Vn i=1 D(i)

Our calculus provides a theoretically rigorous procedure for reasoning about disentanglement. In particular, it is no longer necessary to prove whether the supervision method of interest satisﬁes consistency and restrictiveness for each and every factor. Instead, it sufﬁces to show that a supervision method guarantees consistency or restrictiveness for a subset of factors, and then combine multiple supervision methods via the calculus to guarantee full disentanglement. We can additionally use the calculus to uncover consistency or restrictiveness on individual factors when weak supervision is available only for a subset of variables. For example, achieving consistency on S1,2 and S2,3 implies consistency on the intersection S2. Furthermore, we note that these rules are agnostic to using generator or encoder-based deﬁnitions. We defer the complete proofs to Appendix I.2.

5 FORMALIZING WEAK SUPERVISION WITH GUARANTEES

In this section, we address the question of whether disentanglement arises from the supervision method or model inductive bias. This challenge was ﬁrst put forth by Locatello et al. (2019), who noted that unsupervised disentanglement is heavily reliant on model inductive bias. As we transition toward supervised approaches, it is crucial that we formalize what it means for disentanglement to be guaranteed by weak supervision.

Sufﬁciency for Disentanglement. Let P denote a family of augmented distributions. We say that a weak supervision method S : H P is sufﬁcient for learning a generator whose latent codes ZI disentangle the factors SI if there exists a learning algorithm A : P H such that for any choice of (p (s), g , e ) H, the procedure A S(p (s), g , e ) returns a model (p(z), g, e) for which both

D(I ; p, g, e ) and D(I ; p , g , e) hold, and g(Z) d= g (S).

The key insight of this deﬁnition is that we force the strategy and learning algorithm pair (S, A) to handle all possible oracles drawn from the hypothesis class H. This prevents the exploitation of model inductive bias, since any bias from the learning algorithm A toward a reduced hypothesis class ˆH H will result in failure to handle oracles in the complementary hypothesis class H \ ˆH.

The distribution matching requirement g(Z) d= g (S) ensures latent code informativeness, i.e., preventing trivial solutions where the latent code is uninformative (see Proposition 6 for formal statement). Intuitively, distribution matching paired with a deterministic generator guarantees invertibility of the learned generator and encoder, enforcing that ZI cannot encode less information than SI (e.g., only encoding age group instead of numerical age) and vice versa.

6 ANALYSIS OF WEAK SUPERVISION METHODS

We now apply our theoretical framework to three practical weak supervision methods: restricted labeling, match pairing, and rank pairing. Our main theoretical ﬁndings are that: (1) these methods can be applied in a targeted manner to provide single factor consistency or restrictiveness guarantees; (2) by enforcing consistency (or restrictiveness) on all factors, we can learn models with strong

Published as a conference paper at ICLR 2020

disentanglement performance. Correspondingly, Figure 3 and Figure 5 are our main experimental results, demonstrating that these theoretical guarantees have predictive power in practice.

6.1 THEORETICAL GUARANTEES FROM WEAK SUPERVISION

We prove that if a training algorithm successfully matches the generated distribution to data distribution generated via restricted labeling, match pairing, or rank pairing of factors SI, then ZI is guaranteed to be consistent with SI: Theorem 1. Given any oracle (p (s), g , e ) H, consider the distribution-matching algorithm A that selects a model (p(z), g, e) H such that:

1. (g (S), SI) d= (g(Z), ZI) (Restricted Labeling); or

g (SI, S\I), g (SI, S \I) d= g(ZI, Z\I), g(ZI, Z \I) (Match Pairing); or

3. (g (S), g (S ), 1 {SI S I}) d= (g(Z), g(Z ), 1 {ZI Z I}) (Rank Pairing).

Then (p, g) satisﬁes C(I ; p, g, e ) and e satisﬁes C(I ; p , g , e).

Theorem 1 states that distribution-matching under restricted labeling, match pairing, or rank pairing of SI guarantees both generator and encoder consistency for the learned generator and encoder respectively. We note that while the complement rule C(I) = R(\I) further guarantees that Z\I is restricted to S\I, we can prove that the same supervision does not guarantee that ZI is restricted to SI (Theorem 2). However, if we additionally have restricted labeling for S\I, or match pairing for S\I, then we can see from the calculus that we will have guaranteed R(I) C(I), thus implying disentanglement of factor I. We also note that while restricted labeling and match pairing can be applied on a set of factors at once (i.e. |I| 1), rank pairing is restricted to one-dimensional factors for which an ordering exists. In the experiments below, we empirically verify the theoretical guarantees provided in Theorem 1.

6.2 EXPERIMENTS

We conducted experiments on ﬁve prominent datasets in the disentanglement literature: Shapes3D (Kim & Mnih, 2018), d Sprites (Higgins et al., 2017), Scream-d Sprites (Locatello et al., 2019), Small NORB (Le Cun et al., 2004), and Cars3D (Reed et al., 2015). Since some of the underlying factors are treated as nuisance variables in Small NORB and Scream-d Sprites, we show in Appendix C that our theoretical framework can be easily adapted accordingly to handle such situations. We use generative adversarial networks (GANs, Goodfellow et al. (2014)) for learning (p, g) but any distribution matching algorithm (e.g., maximum likelihood training in tractable models, or VI in latent-variable models) could be applied. Our results are collected over a broad range of hyperparameter conﬁgurations (see Appendix H for details).

Since existing quantitative metrics of disentanglement all measure the performance of an encoder with respect to the true data generator, we trained an encoder post-hoc to approximately invert the learned generator, and measured all quantitative metrics (e.g., mutual information gap) on the encoder. Our theory assumes that the learned generator must be invertible. While this is not true for conventional GANs, our empirical results show that this is not an issue in practice (see Appendix G).

We present three sets of experimental results: (1) Single-factor experiments, where we show that our theory can be applied in a targeted fashion to guarantee consistency or restrictiveness of a single factor. (2) Consistency versus restrictiveness experiments, where we show the extent to which single-factor consistency and restrictiveness are correlated even when the models are only trained to maximize one or the other. (3) Full disentanglement experiments, where we apply our theory to fully disentangle all factors. A more extensive set of experiments can be found in the Appendix.

6.2.1 SINGLE-FACTOR CONSISTENCY AND RESTRICTIVENESS

We empirically verify that single-factor consistency or restrictiveness can be achieved with the supervision methods of interest. Note there are two special cases of match pairing: one where Si is

Published as a conference paper at ICLR 2020

0 1 2 3 4 5 Restricted Labeling at Factor i

0 1 2 3 4 5 Normalized Consistency Score at Factor i

0.99 0.37 0.22 0.22 0.19 0.22

0.17 0.99 0.2 0.34 0.23 0.29

0.33 0.24 0.99 0.091 0.14 0.16

-0.0034 0.0034 0.016 1 0.0024 0.0055

0.021 0.0097 0.033 -3.9e-05 1 0.012

-0.003 0.024 0.011 0.012 0.0032 1

0 1 2 3 4 5 Share Pairing at Factor i

0 1 2 3 4 5 Normalized Consistency Score at Factor i

1 0.21 0.19 0.18 0.14 0.3

0.29 1 0.28 0.11 0.2 0.26

0.25 0.14 1 0.18 0.17 0.13

-0.0028 -0.00078 0.012 0.99 0.0055 0.0043

0.013 0.014 0.021 0.0055 0.99 0.0019

0.007 0.061 0.044 0.0058 0.017 1

0 1 2 3 4 5 Change Pairing at Factor i

0 1 2 3 4 5 Normalized Restrictiveness Score at Factor i

1 0.7 0.63 0.68 0.72 0.67

0.66 1 0.72 0.69 0.72 0.74

0.61 0.64 1 0.81 0.77 0.81

0.93 0.93 0.93 1 0.95 0.95

0.84 0.9 0.88 0.93 1 0.91

0.92 0.91 0.87 0.91 0.91 1

0 1 2 3 4 5 Rank Pairing at Factor i

0 1 2 3 4 5 Normalized Consistency Score at Factor i

0.99 0.34 0.36 0.13 0.21 0.23

0.18 0.99 0.42 0.27 0.48 0.31

0.13 0.39 0.99 0.064 0.13 0.058

0.0019 -0.0029 0.013 0.96 -0.0009 -0.0003

0.036 0.014 0.019 0.0083 0.98 0.013

-0.0066 0.035 0.021 0.00028 0.0059 0.99

0 1 2 3 4 5 Change Pair Intersection at Factor i

0 1 2 3 4 5 Normalized Restrictiveness Score at Factor i

1 0.84 0.83 0.8 0.79 0.77

0.81 1 0.85 0.79 0.81 0.78

0.95 0.95 0.95 0.93 0.91 0.94

0.95 0.93 0.94 1 0.91 0.97

0.83 0.84 0.92 0.85 1 0.94

0.82 0.78 0.79 0.81 0.86 1

Figure 3: Heatmap visualization of ablation studies that measure either single-factor consistency or single-factor restrictiveness as a function of various supervision methods, conducted on Shapes3D. Our theory predicts the diagonal components to achieve the highest scores. Note that share pairing, change pairing, and change pair intersection are special cases of match pairing.

the only factor that is shared between x and x and one where Si is the only factor that is changed. We distinguish these two conditions as share pairing and change pairing, respectively. Theorem 1 shows that restricted labeling, share pairing, and rank pairing of the ith factor are each sufﬁcient supervision strategies for guaranteeing consistency on Si. Change pairing at Si is equivalent to share pairing at S\i; the complement rule C(I) R(\I) allows us to conclude that change pairing guarantees restrictiveness. The ﬁrst four heatmaps in Figure 3 show the results for restricted labeling, share pairing, change pairing, and rank pairing. The numbers shown in the heatmap are the normalized consistency and restrictiveness scores. We deﬁne the normalized consistency score as

c(I ; p , g , e) = 1 Ep I e I g (s I, s\I) e I g (s I, s \I) 2

Es,s iid p e I g (s) e I g (s ) 2 . (12)

This score is bounded on the interval [0, 1] (a consequence of Lemma 1) and is maximal when C(I ; p , g , e) is satisﬁed. This normalization procedure is similar in spirit to the Interventional Robustness score in Suter et al. (2018). The normalized restrictiveness score r can be analogously deﬁned. In practice, we estimate this score via Monte Carlo estimation.

The ﬁnal heatmap in Figure 3 demonstrates the calculus of intersection. In practice, it may be easier to acquire paired data where multiple factors change simultaneously. If we have access to two kinds of datasets, one where SI are changed and one where SJ are changed, our calculus predicts that training on both datasets will guarantee restrictiveness on SI J. The ﬁnal heatmap shows six such intersection settings and measures the normalized restrictiveness score; in all but one setting, the results are consistent with our theory. We show in Figure 7 that this inconsistency is attributable to the failure of the GAN to distribution-match due to sensitivity to a speciﬁc hyperparameter.

6.2.2 CONSISTENCY VERSUS RESTRICTIVENESS

0.78 -0.05 -0.07 -0.05 0.03 -0.10

-0.03 0.85 0.01 -0.11 -0.04 0.11

-0.20 -0.11 0.60 0.19 0.04 0.09

-0.18 -0.13 -0.25 0.61 -0.08 0.05

-0.27 -0.05 -0.19 0.09 0.60 0.08

-0.06 -0.23 -0.24 0.03 0.08 0.56

0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 0

Normalized Restrictiveness Score for Factor 0

0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 1

Normalized Restrictiveness Score for Factor 1

0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 2

Normalized Restrictiveness Score for Factor 2

0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 3

Normalized Restrictiveness Score for Factor 3

0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 4

Normalized Restrictiveness Score for Factor 4

0.0 0.2 0.4 0.6 0.8 1.0 Normalized Consistency Score for Factor 5

Normalized Restrictiveness Score for Factor 5

Figure 4: Correlation plot and scatterplots demonstrating the empirical relationship between c(i) and r(i) across all 864 models trained on Shapes3D.

We now determine the extent to which consistency and restrictiveness are correlated in practice. In Figure 4, we collected all 864 Shapes3D models that we trained in Section 6.2.1 and measured the consistency and restrictiveness of each model on each factor, providing both the correlation plot and scatterplots of c(i) versus r(i). Since the models trained in Section 6.2.1 only ever targeted the consistency or restrictiveness of a single factor, and since our calculus demonstrates that consistency and restrictiveness do not imply each other, one might a priori expect to ﬁnd no correlation in Figure 4. Our results show that the correlation is actually quite strong. Since this correlation is not guaranteed by our choice of weak supervision, it is necessarily a consequence of model inductive

Published as a conference paper at ICLR 2020

bias. We believe this correlation between consistency and restrictiveness to have been a general source of confusion in the disentanglement literature, causing many to either observe or believe that restricted labeling or share pairing on Si (which only guarantees consistency) is sufﬁcient for disentangling Si (Kingma et al., 2014; Chen & Batmanghelich, 2019; Gabbay & Hoshen, 2019; Narayanaswamy et al., 2017). It remains an open question why consistency and restrictiveness are so strongly correlated when training existing models on real-world data.

6.2.3 FULL DISENTANGLEMENT

None Share Change Rank Full-Label Supervision Method

Mutual Information Gap

None Share Change Rank Full-Label Supervision Method

None Share Change Rank Full-Label Supervision Method

Scream-d Sprites

None Share Change Rank Full-Label Supervision Method

None Share Change Rank Full-Label Supervision Method

Figure 5: Disentanglement performance of a vanilla GAN, share pairing GAN, change pairing GAN, rank pairing GAN, and fully-labeled GAN, as measured by the mutual information gap across several datasets. A comprehensive set of performance evaluations on existing disentanglement metrics is available in Figure 13.

If we have access to share / change / rank-pairing data for each factor, our calculus states that it is possible to guarantee full disentanglement. We trained our generative model on either complete share pairing, complete change pairing, or complete rank pairing, and measured disentanglement performance via the discretized mutual information gap (Chen et al., 2018a; Locatello et al., 2019). As negative and positive controls, we also show the performance of an unsupervised GAN and a fully-supervised GAN where the latents are ﬁxed to the ground truth factors of variation. Our results in Figure 5 empirically verify that combining single-factor weak supervision datasets leads to consistently high disentanglement scores.

7 CONCLUSION

In this work, we construct a theoretical framework to rigorously analyze the disentanglement guarantees of weak supervision algorithms. Our paper clariﬁes several important concepts, such as consistency and restrictiveness, that have been hitherto confused or overlooked in the existing literature, and provides a formalism that precisely distinguishes when disentanglement arises from supervision versus model inductive bias. Through our theory and a comprehensive set of experiments, we demonstrated the conditions under which various supervision strategies guarantee disentanglement. Our work establishes several promising directions for future research. First, we hope that our formalism and experiments inspire greater theoretical and scientiﬁc scrutiny of the inductive biases present in existing models. Second, we encourage the search for other learning algorithms (besides distribution-matching) that may have theoretical guarantees when paired with the right form of supervision. Finally, we hope that our framework enables the theoretical analysis of other promising weak supervision methods.

ACKNOWLEDGMENTS

We would like to thank James Brofos and Honglin Yuan for their insightful discussions on the theoretical analysis in this paper, and Aditya Grover and Hung H. Bui for their helpful feedback during the course of this project.

Published as a conference paper at ICLR 2020

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013.

Diane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin. Multi-level variational autoencoder: Learning disentangled representations from grouped observations. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

Junxiang Chen and Kayhan Batmanghelich. Weakly supervised disentanglement by pairwise similarities. ar Xiv preprint ar Xiv:1906.01044, 2019.

Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in Neural Information Processing Systems, pp. 2610 2620, 2018a.

Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C Cobo, Andrew Trask, Ben Laurie, et al. Sample efﬁcient adaptive text-to-speech. ar Xiv preprint ar Xiv:1809.10460, 2018b.

Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. ICLR, 2018.

Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, Narayanaswamy Siddharth, Brooks Paige, Dana H Brooks, Jennifer Dy, and Jan-Willem van de Meent. Structured disentangled representations. ar Xiv preprint ar Xiv:1804.02086, 2018.

Aviv Gabbay and Yedid Hoshen. Latent optimization for non-adversarial representation disentanglement. ar Xiv preprint ar Xiv:1906.11796, 2019.

Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), pp. 80 89. IEEE, 2018.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Luigi Gresele, Paul K Rubenstein, Arash Mehrjou, Francesco Locatello, and Bernhard Sch olkopf. The incomplete rosetta stone problem: Identiﬁability results for multi-view nonlinear ica. ar Xiv preprint ar Xiv:1905.06642, 2019.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR, 2(5):6, 2017.

Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a deﬁnition of disentangled representations. ar Xiv preprint ar Xiv:1812.02230, 2018.

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. ICML, 2018.

Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. Advances in neural information processing systems, pp. 3581 3589, 2014.

Jack Klys, Jake Snell, and Richard Zemel. Learning latent subspaces in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 6444 6454, 2018.

Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pp. 2539 2547, 2015.

Published as a conference paper at ICLR 2020

Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, 2018.

Yann Le Cun, Fu Jie Huang, Leon Bottou, et al. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR (2), pp. 97 104. Citeseer, 2004.

Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, and Ming Hsuan Yang. Drit++: Diverse image-to-image translation via disentangled representations. ar Xiv preprint ar Xiv:1905.01270, 2019.

Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Sch olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. ICML, 2019.

Brian Mc Fee and Gert R Lanckriet. Metric learning to rank. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 775 782, 2010.

Takeru Miyato and Masanori Koyama. cgans with projection discriminator. ar Xiv preprint ar Xiv:1802.05637, 2018.

Siddharth Narayanaswamy, T Brooks Paige, Jan-Willem Van de Meent, Alban Desmaison, Noah Goodman, Pushmeet Kohli, Frank Wood, and Philip Torr. Learning disentangled representations with semi-supervised deep generative models. In Advances in Neural Information Processing Systems, pp. 5925 5935, 2017.

Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogy-making. In Advances in neural information processing systems, pp. 1252 1260, 2015.

Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the f-statistic loss. Advances in Neural Information Processing Systems, pp. 185 194, 2018.

Raphael Suter, Dorde Miladinovic, Stefan Bauer, and Bernhard Sch olkopf. Interventional robustness of deep latent variable models. ICML, 2018.

Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning ﬁne-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386 1393, 2014.

Published as a conference paper at ICLR 2020

Our appendix consists of nine sections. We provide a brief summary of each section below.

Appendix A: We elaborate on the connections between existing deﬁnitions of disentanglement and our deﬁnitions of consistency / restrictiveness / disentanglement. In particular, we highlight three notable properties of our deﬁnitions not present in many existing deﬁnitions.

Appendix B: We evaluate our consistency and restrictiveness metrics on the 10800 models in the disentanglement lib, and identify models where consistency and restrictiveness are not correlated.

Appendix C: We adapt our deﬁnitions to be able to handle nuisance variables. We do so through a simple modiﬁcation of the deﬁnition of restrictiveness.

Appendix D: We show several additional single-factor experiments. We ﬁrst address one of the results in the main text that is not consistent with our theory, and explain why it can be attributed to hyperparameter sensitivity. We next unwrap the heatmaps into more informative boxplots.

Appendix E: We provide an additional suite of consistency versus restrictiveness experiments by comparing the effects of training with share pairing (which guarantees consistency), change pairing (which guarantees restrictiveness), and both.

Appendix F: We provide full disentanglement results on all ﬁve datasets as measured according to six different metrics of disentanglement found in the literature.

Appendix G: We show visualizations of a weakly supervised generative model trained to achieve full disentanglement.

Appendix H: We describe the set of hyperparameter conﬁgurations used in all our experiments.

Appendix I: We provide the complete set of assumptions and proofs for our theoretical framework.

A CONNECTIONS TO EXISTING DEFINITIONS

Numerous deﬁnitions of disentanglement are present in the literature (Higgins et al., 2017; 2018; Kim & Mnih, 2018; Suter et al., 2018; Ridgeway & Mozer, 2018; Eastwood & Williams, 2018; Chen et al., 2018a). We mostly defer to the terminology suggested by Ridgeway & Mozer (2018), which decomposes disentanglement into modularity, compactness, and explicitness. Modularity means a latent code Zi is predictive of at most one factor of variation Sj. Compactness means a factor of variation Si is predicted by at most one latent code Zj. And explicitness means a factor of variation Sj is predicted by the latent codes via a simple transformation (e.g. linear). Similar to Eastwood & Williams (2018); Higgins et al. (2018), we suggest a further decomposition of Ridgeway & Mozer (2018) s explicitness into latent code informativeness and latent code simplicity. In this paper, we omit latent code simplicity from consideration. Since informativeness of the latent code is already enforced by our requirement that g(Z) is equal in distribution to g (S) (see Proposition 6), we focus on comparing our proposed concepts of consistency and restrictiveness to modularity and compactness. We make note of three important distinctions.

Restrictiveness is not synonymous with either modularity or compactness. In Figure 2c, it is evident the factor of variation size is not predictable any individual Zi (conversely, Z1 is not predictable from any individual factor Si). As such, Z1 is neither a modular nor compact representation of size, despite being restricted to size. To our knowledge, no existing quantitative deﬁnition of disentanglement (or its decomposition) speciﬁcally measures restrictiveness.

Consistency and restrictiveness are invariant to statistically dependent factors of variation. Many existing deﬁnitions of disentanglement are instantiated by measuring the mutual information between Z and S. For example, Ridgeway & Mozer (2018) deﬁnes that a latent code Zi to be ideally modular if it has high mutual information with a single factor Sj and zero mutual information with all other factors S\j. This presents a issue when the true factors of variation themselves are statistically dependent; even if Z1 = S1, the latent code Z1 would violate modularity if S1 itself has positive mutual information with S2. Consistency and restrictiveness circumvent this issue by relying on conditional resampling. Consistency, for example, only measures the extent to which SI

Published as a conference paper at ICLR 2020

is invariant to resampling of Z\I when conditioned on ZI and is thus achieved as long as s I is a function of only z I irrespective of whether s I and s\I are statistically dependent. In this regard, our deﬁnitions draw inspiration from Suter et al. (2018) s intervention-based deﬁnition but replaces the need for counterfactual reasoning with the simpler conditional sampling. Because we do not assume the factors of variation are statistically independent, our theoretical analysis is also distinct from the closely-related match pairing analysis in Gresele et al. (2019).

Consistency and restrictiveness arise in weak supervision guarantees. One of our goals is to propose deﬁnitions that are amenable to theoretical analysis. As we can see in Section 4, consistency and restrictiveness serve as the core primitive concepts that we use to describe disentanglement guarantees conferred by various forms of weak supervision.

B EVALUATING CONSISTENCY AND RESTRICTIVENESS ON DISENTANGLEMENT-LIB MODELS

To better understand the empirical relationship between consistency and restrictiveness, we calculated the normalized consistency and restrictiveness scores on the suite of 12800 models from disentanglement lib for each ground-truth factor. By using the normalized consistency and restrictiveness scores as probes, we were able to identify models that achieve high consistency but low restrictiveness (and vice versa). In Fig. 6, we highlight two models that are either consistent or restrictive for object color on the Shapes3D dataset.

(a) Consistent but not Restrictive

(b) Restrictive but not Consistent

Figure 6: Visualization of two models from disentanglement lib (model ids 11964 and 12307), matching the schematic in Fig. 2. For each panel, we visualize an interpolation along a single latent across rows, with each row corresponding to a ﬁxed set of values for all other factors. In Fig. 6a, we can see that this factor consistenly represents object color, i.e. each column of images has the same object color, but as we move along rows we see that other factors change as well, e.g. object type, thus this factor is not restricted to object color. In Fig. 6b, we see that varying the factor along each row results in changes to object color but to no other attributes. However if we look across columns, we see that the representation of color changes depending on the setting of other factors, thus this factor is not consistent for object color.

C HANDLING NUISANCE VARIABLES

Our theoretical framework can handle nuisance variables, i.e., variables we cannot measure or perform weak supervision on. It may be impossible to label, or provide match-pairing on that factor of variation. For example, while many features of an image are measurable (such as brightness and

Published as a conference paper at ICLR 2020

coloration), we may not be able to measure certain factors of variation or generate data pairs where these factors are kept constant. In this case, we can let one additional variable η act as nuisance variable that captures all additional sources of variation / stochasticity.

Formally, suppose the full set of true factors is S {η} Rn+1. We deﬁne η-consistency Cη(I) = C(I) and η-restrictiveness Rη(I) = R(I {η}). This captures our intuition that, with nuisance variable, for consistency, we still want changes to Z\I {η} to not modify SI; for restrictiveness, we want changes to ZI {η} to only modify SI {η}. We deﬁne η-disentanglement as Dη(I) = Cη(I) Rη(I).

All of our calculus still holds where we substitute Cη(I), Rη(I), Dη(I) for C(I), R(I), D(I); we prove one of the new full disentanglement rule as an illustration:

Proposition 1. Vn i=1 Cη(i) Vn i=1 Dη(i).

Proof. On the one hand, Vn i=1 Cη(i) Vn i=1 C(i) = C(1 : n) = R(η). On the other hand, Vn i=1 C(i) = Vn i=1 D(i) = Vn i=1 R(i). Therefore LHS = i [n], R(i) R(η) = Rη(i). The reverse direction is trivial.

In (Locatello et al., 2019), the instance factor in Small NORB and the background image factor in Scream-d Sprites are treated as nuisance variables. By Proposition 1, as long as we perform weak supervision on all of the non-nuisance variables (via sharing-pairing, say) to guarantee their consistency with respect to the corresponding true factor of variation, we still have guaranteed full disentanglement despite the existence of nuisance variable and the fact that we cannot measure or perform weak supervision on nuisance variable.

Published as a conference paper at ICLR 2020

D SINGLE-FACTOR EXPERIMENTS

0 1 2 3 4 5 Restricted Labeling at Factor i

0 1 2 3 4 5 Normalized Consistency Score at Factor i

0.99 0.37 0.22 0.22 0.19 0.22

0.17 0.99 0.2 0.34 0.23 0.29

0.33 0.24 0.99 0.091 0.14 0.16

-0.0034 0.0034 0.016 1 0.0024 0.0055

0.021 0.0097 0.033 -3.9e-05 1 0.012

-0.003 0.024 0.011 0.012 0.0032 1

0 1 2 3 4 5 Share Pairing at Factor i

0 1 2 3 4 5 Normalized Consistency Score at Factor i

1 0.21 0.38 0.2 0.12 0.36

0.27 1 0.35 0.12 0.27 0.22

0.34 0.23 1 0.2 0.37 0.15

-0.0016 0.0059 0.012 0.99 0.016 0.0075

0.012 0.01 0.018 0.016 0.99 0.0085

0.04 0.065 0.071 0.0017 0.017 1

0 1 2 3 4 5 Change Pairing at Factor i

0 1 2 3 4 5 Normalized Restrictiveness Score at Factor i

1 0.63 0.6 0.68 0.72 0.65

0.67 1 0.73 0.73 0.74 0.77

0.6 0.64 1 0.82 0.74 0.81

0.93 0.93 0.93 1 0.95 0.95

0.84 0.93 0.91 0.93 1 0.91

0.92 0.85 0.83 0.9 0.9 1

0 1 2 3 4 5 Rank Pairing at Factor i

0 1 2 3 4 5 Normalized Consistency Score at Factor i

0.99 0.2 0.39 0.19 0.13 0.17

0.034 0.99 0.23 0.22 0.49 0.47

0.066 0.3 0.99 0.053 0.14 0.045

0.0021 0.01 0.015 0.96 -0.009 0.0035

0.031 0.013 0.022 0.0043 0.98 0.0079

-0.011 0.027 0.023 0.0051 0.0076 0.99

0 1 2 3 4 5 Change Pair Intersection at Factor i

0 1 2 3 4 5 Normalized Restrictiveness Score at Factor i

1 0.98 0.81 0.8 0.8 0.76

0.81 1 0.8 0.78 0.81 0.77

0.95 0.98 1 0.93 0.92 0.91

0.96 0.9 0.94 1 0.82 0.99

0.84 0.84 0.96 0.86 1 0.98

0.77 0.74 0.91 0.8 0.77 1

Figure 7: This is the same plot as Figure 7, but where we restrict our hyperparameter sweep to always set extra dense = False. See Appendix H for details about hyperparameter sweep.

0 1 2 3 4 5 Restricted Labeling at Factor i

Normalized Consistency Score for Factor 0

0 1 2 3 4 5 Restricted Labeling at Factor i

Normalized Consistency Score for Factor 1

0 1 2 3 4 5 Restricted Labeling at Factor i

Normalized Consistency Score for Factor 2

0 1 2 3 4 5 Restricted Labeling at Factor i

Normalized Consistency Score for Factor 3

0 1 2 3 4 5 Restricted Labeling at Factor i

Normalized Consistency Score for Factor 4

0 1 2 3 4 5 Restricted Labeling at Factor i

Normalized Consistency Score for Factor 5

Figure 8: Restricted pairing guarantees consistency. Each plot shows the normalized consistency score of each model for each factor of variation. Our theory predicts each boxplot highlighted in red to achieve the highest consistency. Due to the prevalence of restricted pairing in the existing literature, we chose to only conduct the single-factor restricted labeling experiment on Shapes3D.

Published as a conference paper at ICLR 2020

0 1 2 3 4 5 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 0

0 1 2 3 4 5 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 1

0 1 2 3 4 5 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 2

0 1 2 3 4 5 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 3

0 1 2 3 4 5 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 4

0 1 2 3 4 5 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 5

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 0

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 1

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 2

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 3

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 4

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 0

Scream-d Sprites

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 1

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 2

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 3

0 1 2 3 4 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 4

0 1 2 3 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 0

0 1 2 3 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 1

0 1 2 3 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 2

0 1 2 3 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 3

0 1 2 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 0

0 1 2 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 1

0 1 2 Supervision via Change Pairing at Factor i

Normalized Restrictiveness Score for Factor 2

Figure 9: Change pairing guarantees restrictiveness. Each plot shows normalized restrictiveness score of each model for each factor of variation (row) across different datasets (columns). Different colors indicate models trained with change pairing on different factors. The appropriatelysupervised model for each factor is marked in red.

Published as a conference paper at ICLR 2020

0 1 2 3 4 5 Supervision via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 0

0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 1

0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 2

0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 3

0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 4

0 1 2 3 4 5 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 5

0 1 2 3 4 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 0

0 1 2 3 4 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 1

0 1 2 3 4 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 2

0 1 2 3 4 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 3

0 1 2 3 4 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 4

0 1 2 3 4 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 0

Scream-d Sprites

0 1 2 3 4 Supervision via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 1

0 1 2 3 4 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 2

0 1 2 3 4 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 3

0 1 2 3 4 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 4

0 1 2 3 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 0

0 1 2 3 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 1

0 1 2 3 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 2

0 1 2 3 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 3

0 1 2 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 0

0 1 2 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 1

0 1 2 Supervisi n via Share Pairing at Fact r i

N rmalized C nsistency Sc re f r Fact r 2

Figure 10: Share pairing guarantees consistency. Each plot shows normalized consistency score of each model for each factor of variation (row) across different datasets (columns). Different colors indicate models trained with share pairing on different factors. The appropriately-supervised model for each factor is marked in red.

Published as a conference paper at ICLR 2020

0 1 2 3 4 5 Rank Pairi g at Factor i

Normalized Co siste cy Score for Factor 0

0 1 2 3 4 5 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 1

0 1 2 3 4 5 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 2

0 1 2 3 4 5 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 3

0 1 2 3 4 5 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 4

0 1 2 3 4 5 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 5

0 1 2 3 4 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 0

0 1 2 3 4 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 1

0 1 2 3 4 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 2

0 1 2 3 4 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 3

0 1 2 3 4 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 4

0 1 2 3 4 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 0

Scream-d Sprites

0 1 2 3 4 Rank Pairi g at Factor i

Normalized Co siste cy Score for Factor 1

0 1 2 3 4 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 2

0 1 2 3 4 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 3

0 1 2 3 4 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 4

0 1 2 3 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 0

0 1 2 3 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 1

0 1 2 3 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 2

0 1 2 3 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 3

0 1 2 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 0

0 1 2 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 1

0 1 2 Ra k Pairi g at Factor i

Normalized Co siste cy Score for Factor 2

Figure 11: Rank pairing guarantees consistency. Each plot shows normalized consistency score of each model for each factor of variation (row) across different datasets (columns). Different colors indicate models trained with rank pairing on different factors. The appropriately-supervised model for each factor is marked in red.

Published as a conference paper at ICLR 2020

E CONSISTENCY VERSUS RESTRICTIVENESS

0.9965 0.9970 0.9975 0.9980 0.9985 0.9990 0.9995

Normalized Restrictiveness Sc re f r Fact r 0

N rmalized C nsistency Sc re f r Fact r 0

Change Share B th

0.995 0.996 0.997 0.998 0.999 1.000 N rmalized Restrictiveness Sc re f r Fact r 1

N rmalized C nsistency Sc re f r Fact r 1

Change Share B th

0.992 0.994 0.996 0.998 1.000 N rmalized Restrictiveness Sc re f r Fact r 2

N rmalized C nsistency Sc re f r Fact r 2

Change Share B th

0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.999 1.000

N rmalized Restrictiveness Sc re f r Fact r 3

N rmalized C nsistency Sc re f r Fact r 3

Change Share B th

0.995 0.996 0.997 0.998 0.999 N rmalized Restrictiveness Sc re f r Fact r 4

N rmalized C nsistency Sc re f r Fact r 4

Change Share B th

0.99500.99550.99600.99650.99700.99750.99800.9985

N rmalized Restrictiveness Sc re f r Fact r 5

N rmalized C nsistency Sc re f r Fact r 5

Change Share B th

0.86 0.88 0.90 0.92 0.94 0.96 0.98 N rmalized Restrictiveness Sc re f r Fact r 0

N rmalized C nsistency Sc re f r Fact r 0

Change Share B th

0.975 0.980 0.985 0.990 0.995 N rmalized Restrictiveness Sc re f r Fact r 1

N rmalized C nsistency Sc re f r Fact r 1

Change Share B th

0.95 0.96 0.97 0.98 0.99 N rmalized Restrictiveness Sc re f r Fact r 2

N rmalized C nsistency Sc re f r Fact r 2

Change Share B th

0.96 0.97 0.98 0.99 1.00 N rmalized Restrictiveness Sc re f r Fact r 3

N rmalized C nsistency Sc re f r Fact r 3

Change Share B th

0.96 0.97 0.98 0.99 1.00 N rmalized Restrictiveness Sc re f r Fact r 4

N rmalized C nsistency Sc re f r Fact r 4

Change Share B th

0.2 0.4 0.6 0.8 N rmalized Restrictiveness Sc re f r Fact r 0

N rmalized C nsistency Sc re f r Fact r 0

Scream-d Sprites

Change Share Both

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Restrictiveness Sc re f r Fact r 1

N rmalized C nsistency Sc re f r Fact r 1

Change Share B th

0.0 0.2 0.4 0.6 0.8 1.0 N rmalized Restrictiveness Sc re f r Fact r 2

N rmalized C nsistency Sc re f r Fact r 2

Change Share B th

0.2 0.0 0.2 0.4 0.6 0.8 1.0 N rmalized Restrictiveness Sc re f r Fact r 3

N rmalized C nsistency Sc re f r Fact r 3

Change Share B th

0.2 0.0 0.2 0.4 0.6 0.8 N rmalized Restrictiveness Sc re f r Fact r 4

N rmalized C nsistency Sc re f r Fact r 4

Change Share B th

0.2 0.4 0.6 0.8 1.0 N rmalized Restrictiveness Sc re f r Fact r 0

N rmalized C nsistency Sc re f r Fact r 0

Change Share B th

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 N rmalized Restrictiveness Sc re f r Fact r 1

N rmalized C nsistency Sc re f r Fact r 1

Change Share B th

0.4 0.5 0.6 0.7 0.8 0.9 1.0 N rmalized Restrictiveness Sc re f r Fact r 2

N rmalized C nsistency Sc re f r Fact r 2

Change Share B th

0.5 0.6 0.7 0.8 0.9 N rmalized Restrictiveness Sc re f r Fact r 3

N rmalized C nsistency Sc re f r Fact r 3

Change Share B th

0.88 0.90 0.92 0.94 0.96 0.98 1.00 N rmalized Restrictiveness Sc re f r Fact r 0

N rmalized C nsistency Sc re f r Fact r 0

Change Share B th

0.75 0.80 0.85 0.90 0.95 1.00 N rmalized Restrictiveness Sc re f r Fact r 1

N rmalized C nsistency Sc re f r Fact r 1

Change Share B th

0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 N rmalized Restrictiveness Sc re f r Fact r 2

N rmalized C nsistency Sc re f r Fact r 2

Change Share B th

Figure 12: Normalized consistency vs. restrictiveness score of different models on each factor (row) across different datasets (columns). In many of the plots, we see that models trained via changesharing (blue) achieve higher restrictiveness; models trained via share-sharing (orange) achieve higher consistency; models trained via both techniques (green) simultaneously achieve restrictiveness and consistency in most cases.

Published as a conference paper at ICLR 2020

F FULL DISENTANGLEMENT EXPERIMENTS

None Share Change Rank Full-Label

Beta VAE Score

None Share Change Rank Full-Label

Factor VAE Score

None Share Change Rank Full-Label

Mutual Information Gap

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label Supervision Method

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label 0.0

None Share Change Rank Full-Label 0.0

None Share Change Rank Full-Label

None Share Change Rank Full-Label Supervision Method

None Share Change Rank Full-Label

Scream-d Sprites

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label Supervision Method

None Share Change Rank Full-Label

None Share Change Rank Full-Label 0.3

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label Supervision Method

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label Supervision Method

Figure 13: Disentanglement performance of a vanilla GAN, share pairing GAN, change pairing GAN, rank pairing GAN, and fully-labeled GAN, as measured by multiple disentanglement metrics in existing literature (rows) across multiple datasets (columns). According to almost all metrics, our weakly supervised models surpass the baseline, and in some cases, even outperform the fully-labeled model.

Published as a conference paper at ICLR 2020

None Share Change Rank Full-Label

Normalized Consistency Score at Factor 0

None Share Change Rank Full-Label

Normalized Consistency Score at Factor 1

None Share Change Rank Full-Label

Normalized Consistency Score at Factor 2

None Share Change Rank Full-Label

Normalized Consistency Score at Factor 3

None Share Change Rank Full-Label

Normalized Consistency Score at Factor 4

None Share Change Rank Full-Label

Normalized Consistency Score at Factor 5

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

Scream-d Sprites

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label 0.0

None Share Change Rank Full-Label

Figure 14: Performance of a vanilla GAN (blue), share pairing GAN (orange), change pairing GAN (green), rank pairing GAN (red), and fully-labeled GAN (purple), as measured by normalized consistency score of each factor (rows) across multiple datasets (columns). Factors {3, 4, 5} in the ﬁrst column shows that distribution matching to all six change / share pairing datasets is particularly challenging for the models when trained on certain hyperparameter choices. However, since consistency and restrictiveness can be measured in weakly supervised settings, it sufﬁces to use these metrics for hyperparameter selection. We see in Figure 16 and Appendix G that using consistency and restrictiveness for hyperparameter selection serves as a viable weakly-supervised surrogate for existing fully-supervised disentanglement metrics.

Published as a conference paper at ICLR 2020

None Share Change Rank Full-Label

Normalized Restrictiveness Score at Factor 0

None Share Change Rank Full-Label

Normalized Restrictiveness Score at Factor 1

None Share Change Rank Full-Label

Normalized Restrictiveness Score at Factor 2

None Share Change Rank Full-Label

Normalized Restrictiveness Score at Factor 3

None Share Change Rank Full-Label

Normalized Restrictiveness Score at Factor 4

None Share Change Rank Full-Label

Normalized Restrictiveness Score at Factor 5

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label 0.5

None Share Change Rank Full-Label

Scream-d Sprites

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label 0.0

None Share Change Rank Full-Label 0.0

None Share Change Rank Full-Label 0.1

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label

None Share Change Rank Full-Label 0.775

None Share Change Rank Full-Label

None Share Change Rank Full-Label

Figure 15: Performance of a vanilla GAN (blue), share pairing GAN (orange), change pairing GAN (green), rank pairing GAN (red), and fully-labeled GAN (purple), as measured by normalized restrictiveness score of each factor (rows) across multiple datasets (columns). Since restrictiveness and consistency are complementary, we see that the anomalies in Figure 14 are reﬂected in the complementary factors in this ﬁgure.

Published as a conference paper at ICLR 2020

Beta VAE Score

Factor VAE Score

Mutual Information Gap

0.0 0.2 0.4 0.6 0.8 1.0 Average Normalized Consistency

0.80 0.85 0.90 0.95 1.00 Average Normalized Restrictiveness

(a) Shapes3D

Beta VAE Score

Factor VAE Score

Mutual Information Gap

0.0 0.2 0.4 0.6 0.8 Average Normalized Consistency

0.70 0.75 0.80 0.85 0.90 0.95 1.00 Average Normalized Restrictiveness

(b) d Sprites

Beta VAE Score

Factor VAE Score

Mutual Information Gap

0.0 0.2 0.4 0.6 0.8 Average Normalized Consistency

0.2 0.4 0.6 0.8 1.0 Average Normalized Restrictiveness

(c) Scream-d Sprites

Beta VAE Score

Factor VAE Score

Mutual Information Gap

0.0 0.2 0.4 0.6 0.8 1.0 Average Normalized Consistency

0.4 0.6 0.8 1.0 Average Normalized Restrictiveness

(d) Small NORB

Beta VAE Score

Factor VAE Score

Mutual Information Gap

0.2 0.4 0.6 0.8 1.0 Average Normalized Consistency

0.5 0.6 0.7 0.8 0.9 1.0 Average Normalized Restrictiveness

Figure 16: Scatterplot of existing disentanglement metrics versus average normalized consistency and restrictiveness. Whereas existing disentanglement metrics are fully-supervised, it is possible to measure average normalized consistency and restrictiveness with weakly supervised data (sharepairing and match-pairing respectively), making it viable to perform hyperparameter tuning under weakly supervised conditions.

Published as a conference paper at ICLR 2020

G FULL DISENTANGLEMENT VISUALIZATIONS

As a demonstration of the weakly-supervised generative models, we visualize our best-performing match-pairing generative models (as selected according to the normalized consistency score averaged across all the factors). Recall from Figures 2a to 2c that, to visually check for consistency and restrictiveness, it is important that we not only ablate a single factor (across the column), but also show that the factor stays consistent (down the row). Each block of 3 12 images in Figures 17 to 21 checks for disentanglement of the corresponding factor. Each row is constructed by random sampling of Z\i and then ablating Zi.

Figure 17: Cars3D. Ground truth factors: elevation, azimuth, object type.

Published as a conference paper at ICLR 2020

Figure 18: Shapes3D. Ground truth factors: ﬂoor color, wall color, object color, object size, object type, and azimuth.

Published as a conference paper at ICLR 2020

Figure 19: d Sprites. Ground truth factors: shape, scale, orientation, X-position, Y-position.

Published as a conference paper at ICLR 2020

Figure 20: Scream-d Sprites. Ground truth factors: shape, scale, orientation, X-position, Y-position.

Published as a conference paper at ICLR 2020

Figure 21: Small NORB. Ground truth factors: category, elevation, azimuth, lighting condition.

Published as a conference paper at ICLR 2020

H HYPERPARAMETERS

Table 1: We trained a probablistic Gaussian encoder to approximately invert the generative model. The encoder is not trained jointly with the generator, but instead trained separately from the generative model (i.e. encoder gradient does not backpropagate to generative model). During training, the encoder is only exposed to data generated by the learned generative model.

4 4 spectral norm conv. 32. l Re LU 4 4 spectral norm conv. 32. l Re LU 4 4 spectral norm conv. 64. l Re LU 4 4 spectral norm conv. 64. l Re LU ﬂatten 128 spectral norm dense. l Re LU 2 z-dim spectral norm dense

Table 2: Generative model architecture.

128 dense. Re LU. batchnorm. 1024 dense. Re LU. batchnorm. 4 4 64 reshape. 4 4 conv. 64. l Re LU. batchnorm. 4 4 conv. 32. l Re LU. batchnorm. 4 4 conv. 32. l Re LU. batchnorm. 4 4 conv. 3. sigmoid

Table 3: Discriminator used for restricted labeling. Parts in red are part of hyperparameter search.

Discriminator Body

4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU ﬂatten if extra dense: 128 width spectral norm dense. l Re LU

Discriminator Auxiliary Channel for Label

128 width spectral norm dense. l Re LU If extra dense: 128 width spectral norm dense. l Re LU

Discriminator head

concatenate body and auxiliary. 128 width spectral norm dense. l Re LU 128 width spectral norm dense. l Re LU 1 spectral norm dense with bias.

Published as a conference paper at ICLR 2020

Table 4: Discriminator used for match pairing. We use a projection discriminator (Miyato & Koyama, 2018) and thus have an unconditional and conditional head. Parts in red are part of hyperparameter search.

Discriminator Body Applied Separately to x and x

4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU ﬂatten If extra dense: 128 width spectral norm dense. l Re LU concatenate the pair. 128 width spectral norm dense. l Re LU 128 width spectral norm dense. l Re LU

Unconditional Head

1 spectral norm dense with bias

Conditional Head

128 width spectral norm dense

Table 5: Discriminator used for rank pairing. For rank-pairing, we use a special variant of the projection discriminator, where the conditional logit is computed via taking the difference between the two pairs and multiplying by y { 1, +1}. The discriminator is thus implicitly taking on the role of an adversarially trained encoder that checks for violations of the ranking rule in the embedding space. Parts in red are part of hyperparameter search.

Discriminator Body Applied Separately to x and x

4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 32 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU 4 4 spectral norm conv. 64 width. l Re LU ﬂatten If extra dense: 128 width spectral norm dense. l Re LU concatenate the pair.

Unconditional Head Applied Separately to x and x

1 spectral norm dense with bias.

Conditional Head Applied Separately to x and x

y-dim spectral norm dense.

Published as a conference paper at ICLR 2020

For all models, we use the Adam optimizer with β1 = 0.5, β2 = 0.999 and set the generator learning rate to 1 10 3. We use a batch size of 64 and set the leaky Re LU negative slope to 0.2.

To demonstrate some degree of robustness to hyperparameter choices, we considered ﬁve different ablations:

1. Width multiplier on the discriminator network ({1, 2})

2. Whether to add an extra fully-connected layer to the discriminator ({True, False}).

3. Whether to add a bias term to the head ({True, False}).

4. Whether to use two-time scale learning rate by setting encoder+discriminator learning rate multipler to ({1, 2}).

5. Whether to use the default Py Torch or Keras initialization scheme in all models.

As such, each of our experimental setting trains a total of 32 distinct models. The only exception is the intersection experiments where we ﬁxed the width multiplier to 1.

To give a sense of the scale of our experimental setup, note that the 864 models in Figure 4 originate as follows:

1. 32 hyperparameter conditions 6 restricted labeling conditions.

2. 32 hyperparameter conditions 6 match pairing conditions.

3. 32 hyperparameter conditions 6 share pairing conditions.

4. 32 hyperparameter conditions 6 rank pairing conditions.

5. 16 hyperparameter conditions 6 intersection conditions.

Published as a conference paper at ICLR 2020

I.1 ASSUMPTIONS ON H

Assumption 1. Let D [n] indexes discrete random variables SD. Assume that the remaining random variables SC = S\D have probability density function p(s C|s D) for any set of values s D where p(SD = s D) > 0. Assumption 2. Without loss of generality, suppose S1:n = [SC, SD] is ordered by concatenating the continuous variables with the discrete variables. Let B(s D) = [int(supp(p(s C | s D))), s D] denote the interior of the support of the continuous conditional distribution of SC concatenated with its conditioning variable s D drawn from SD. With a slight abuse of notation, let B(S) = S

s D:p(s D>0) B(s D). We assume B(S) is zig-zag connected, i.e., for any I, J [n], for any two points s1:n, s 1:n B(S) that only differ in coordinates in I J, there exists a path {st 1:n}t=0:T contained in B(S) such that

s0 1:n = s1:n (13)

s T 1:n = s 1:n (14)

0 t < T, either st \I = st+1 \I or st \J = st+1 \J , (15)

Intuitively, this assumption allows transition from s1:n to s 1:n via a series of modiﬁcations that are only in I or only in J. Note that zig-zag connectedness is necessary for restrictiveness union (Proposition 3) and consistency intersection (Proposition 4). Fig. 22 gives examples where restrictiveness union is not satisﬁed when zig-zag connectedness is violated. Assumption 3. For arbitrary coordinate j [m] of g that maps to a continuous variable Xj, we assume that gj(s) is continuous at s, s B(S); For arbitrary coordinate j [m] of g that maps to a discrete variable Xj, s D where p(s D) > 0, we assume that gj(s) is constant over each connected component of int(supp(p(s C | s D)).

Deﬁne B(X) analogously to B(S). Symmetrically, for arbitrary coordinate i [n] of e that maps to a continuous variable Si, we assume that ei(x) is continuous at x, x B(X); For arbitrary coordinate i [n] of e that maps to a discrete Si, x D where p(x D) > 0, we assume that ei(x) is constant over each connected component of int(supp(p(x C | x D)). Assumption 4. Assume that every factor of variation is recoverable from the observation X. Formally, (p, g, e) satisﬁes the following property

Ep(s1:n) e g(s1:n) s1:n 2 = 0. (16)

I.2 CALCULUS OF DISENTANGLEMENT

I.2.1 EXPECTED-NORM REDUCTION LEMMA

Lemma 1. Let x, y be two random variables with distribution p, f(x, y) be arbitrary function. Then

Ex p(x)Ey,y p(y|x) f(x, y) f(x, y ) 2 E(x,y),(x ,y ) p(x,y) f(x, y) f(x , y ) 2.

Proof. Assume w.l.o.g that E(x,y) p(x,y)f(x, y) = 0.

LHS = 2E(x,y) p(x,y) f(x, y) 2 2Ex p(x)Ey,y p(y|x)f(x, y)T f(x, y ) (17)

= 2E(x,y) p(x,y) f(x, y) 2 2Ex p(x)Ey p(y|x)f(x, y)T Ey p(y|x)f(x, y ) (18)

= 2E(x,y) p(x,y) f(x, y) 2 2Ex p(x) Ey p(y|x)f(x, y) 2 (19)

2E(x,y) p(x,y) f(x, y) 2 (20)

= 2E(x,y) p(x,y) f(x, y) 2 2 E(x,y) p(x,y)f(x, y) 2 (21)

= 2E(x,y) p(x,y) f(x, y) 2 2E(x,y),(x ,y ) p(x,y)f(x, y)T f(x , y ) (22)

= RHS. (23)

Published as a conference paper at ICLR 2020

I.2.2 CONSISTENCY UNION

Let L = I J,K = \ (I J), M = I L,N = J L. Proposition 2. C(I) C(J) = C(I J).

C(I) = Ez M,z LEz N,z N,z K,z K r I G(z M, z L, z N, z K) r I G(z M, z L, z N, z K) 2 = 0. (24)

For any ﬁxed value of z M, z L,

Ez N,z N,z K,z K r I G(z M, z L, z N, z K) r I G(z M, z L, z N, z K) 2 (25)

Ez N Ez K,z K r I G(z M, z L, z N, z K) r I G(z M, z L, z N, z K) 2. (26)

by plugging in x = z N, y = z K into Lemma 1. Therefore

C(I) = Ez M,z L,z N Ez K,z K r I G(z M, z L, z N, z K) r I G(z M, z L, z N, z K) 2 = 0. (27)

Similarly we have

C(J) = Ez M,z L,z N Ez K,z K r J G(z M, z L, z N, z K) r J G(z M, z L, z N, z K) 2 = 0 (28)

= Ez M,z L,z N Ez K,z K r N G(z M, z L, z N, z K) r N G(z M, z L, z N, z K) 2 = 0. (29)

As I N = , I N = I J, adding the above two equations gives us C(I J).

I.2.3 RESTRICTIVENESS UNION

Figure 22: Zig-zag connectedness is necessary for restriveness union. Here n = m = 3. Colored areas indicate the support of p(z1, z2); the marked numbers indicate the measurement of s3 given (z1, z2). Left two panels satisfy zig-zag connectedness (the paths are marked in gray) while the right two do not (indeed R(1) R(2) R({1, 2})). In the right-most panel, any zig-zag path connecting two points from blue and orange areas has to pass through boundary of the support (disallowed).

Similarly deﬁne index sets L, K, M, N. Proposition 3. Under assumptions speciﬁed in Appendix I.1, R(I) R(J) = R(I J).

Proof. Denote f = e K g. We claim that

R(I) Ez\IEz I,z I f(z I, z\I) f(z I, z\I) 2 = 0. (30)

(z I, z\I), (z I, z\I) B(Z), f(z I, z\I) = f(z I, z\I). (31)

We ﬁrst prove the backward direction: When we draw z\I p(z\I), z I, z I p(z I|z\I), let E1 denote the event that (z I, z\I) / B(Z), and E2 denote the event that (z I, z\I) / B(Z). Reorder the indices of (z I, z\I) as (z C, z D). The probability that (z I, z\I) / B(Z) (i.e.,z C is on the boundary of B(z D)) is 0. Therefore Pr[E1] = Pr[E2] = 0. Therefore Pr[E1 E2] Pr[E1] + Pr[E2] = 0, i.e., with probability 1, f(z I, z\I) f(z I, z\I) 2 = 0.

Published as a conference paper at ICLR 2020

Now we prove the forward direction: Assume for the sake of contradiction that (z I, z\I), (z I, z\I) B(Z) such that f(z I, z\I) < f(z I, z\I). Denote U = I D, V = I C, W = \I D, Q = \I C. We have f(z U, z V , z W , z Q) < f(z U, z V , z W , z Q). Since f is continuous (or constant) at (z U, z V , z W , z Q) in the interior of B([z U, z W ]), and f is also continuous (or constant) at (z U, z V , z W , z Q) in the interior of B([z U, z W ]), we can draw open balls of radius r > 0 around each point, i.e., Br(z V , z Q) B([z U, z W ]) and Br(z V , z Q) B([z U, z W ]), where

(z V , z Q) Br(z V , z Q), (z V , z Q) Br(z V , z Q), f(z U, z V , z W , z Q) < f(z U, z V , z W , z Q). (32)

When we draw z\I p(z\I), z I, z I p(z I|z\I), let C denote the event that (z I, z\I) = (z V , z U, z# Q, z W ), (z I, z\I) = (z V , z U, z# Q, z W ) where (z V , z# Q) Br(z V , z Q) and (z V , z# Q) Br(z V , z Q). Since both balls have positive volume, Pr[C] > 0. However, f(z I, z\I) f(z I, z\I) 2 > 0 whenever event C happens, which contradicts R(I). Therefore (z I, z\I), (z I, z\I) B(Z), f(z I, z\I) = f(z I, z\I).

We have shown that

R(I) (z M, z L, z N, z K), (z M, z L, z N, z K) B(Z), f(z M, z L, z N, z K) = f(z M, z L, z N, z K). (33)

R(J) (z M, z L, z N, z K), (z M, z L, z N, z K) B(Z), f(z M, z L, z N, z K) = f(z M, z L, z N, z K). (34)

R(I J) (z M, z L, z N, z K), (z M, z L, z N, z K) B(Z), f(z M, z L, z N, z K) = f(z M, z L, z N, z K) (35)

Let the zig-zag path between (z M, z L, z N, z K) and (z M, z L, z N, z K) B(Z) be {(zt M, zt L, zt N, z K)}T t=0. Repeatedly applying the equivalent conditions of R(I) and R(J) gives us

f(z M, z L, z N, z K) = f(z1 M, z1 L, z1 N, z K) = = f(z T 1 M , z T 1 L , z T 1 N , z K) = f(z M, z L, z N, z K). (36)

I.3 CONSISTENCY AND RESTIVENESS INTERSECTION

Proposition 4. Under the same assumptions as restrictiveness union, C(I) C(J) = C(I J).

C(I) C(J) = R(\I) R(\J) (37) = R(\I \J) (38) = C(\(\I \J)) (39) = C(I J). (40)

Proposition 5. R(I) R(J) = R(I J).

Proof is analogous to Proposition 4.

I.4 DISTRIBUTION MATCHING GUARANTEES LATENT CODE INFORMATIVENESS

Proposition 6. If (p , g , e ) H, and (p, g, e) H, and g (S) d= g(Z), then there exists a continuous function r such that

Ep(s1:n) r e g (s) s = 0. (41)

Published as a conference paper at ICLR 2020

Proof. We show that r = e g satisﬁes Proposition 6. By Assumption 4,

Es e g (s) s 2 = 0. (42)

Ez e g(z) z 2 = 0. (43)

By the same reasoning as in the proof of Proposition 3,

Es e g (s) s 2 = 0 = s B(S), e g (s) = s. (44)

Ez e g(z) z 2 = 0 = z B(Z), e g(z) = z. (45)

Let s p(s). We claim that Pr[E1] = 1, where E1 denote the event that z B(Z) such that g (s) = g(z). Suppose to the contrary that there is a measure-non-zero set S supp(p(s)) such

that s S, no z B(Z) satisﬁes g (s) = g(z). Let X = {g(s) : s S}. As g (S) d= g(Z), Prs[g (s) X] = Prz[g(z) X] > 0. Therefore Z supp(p(z)) B(Z) such that X {g(z) : z Z)}. But supp(p(z)) B(Z) has measure 0. Contradiction.

When we draw s, let E2 denote the event that s B(S). Pr[E2] = 1, so Pr[E1 E2] = 1. When E1 E2 happens, e g e g (s) = e g e g(z) = e g(z) = e g (s) = s. Therefore

Es e g e g (s) s = 0. (46)

I.5 WEAK SUPERVISION GUARANTEE

Theorem 1. Given any oracle (p (s), g , e ) H, consider the distribution-matching algorithm A that selects a model (p(z), g, e) H such that:

1. (g (S), SI) d= (g(Z), ZI) (Restricted Labeling); or

g (SI, S\I), g (SI, S \I) d= g(ZI, Z\I), g(ZI, Z \I) (Match Pairing); or

3. (g (S), g (S ), 1 {SI S I}) d= (g(Z), g(Z ), 1 {ZI Z I}) (Rank Pairing).

Then (p, g) satisﬁes C(I ; p, g, e ) and e satisﬁes C(I ; p , g , e).

Proof. We prove the three cases separately:

1. Since (xd, s I) d= (xg, z I), consider the measurable function

f(a, b) = e I(a) b 2. (47)

E e I(xd) s I 2 = E e I(xg) z I 2 = 0. (48)

By the same reasoning as in the proof of Proposition 3,

Ez e I g(z) z I 2 = 0 = z B(Z), e I g(z) = z I. (49)

Ez IEz\I,z \I e I g(z I, z\I) e I g(z I, z \I) 2 = 0. (50)

i.e., g satisﬁes C(I ; p, g, e ). By symmetry, e satisﬁes C(I ; p , g , e).

2. g (SI, S\I), g (SI, S \I) d= g(ZI, Z\I), g(ZI, Z \I) (51)

= e I g (SI, S\I) e I g (SI, S \I) 2 d= e I g(ZI, Z\I) e I g(ZI, Z \I) 2

= Ez IEz\I,z \I e I g(z I, z\I) e I g(z I, z \I) 2 = 0. (53)

So g satisﬁes C(I ; p, g, e ). By symmetry, e satisﬁes C(I ; p , g , e).

Published as a conference paper at ICLR 2020

3. Let I = {i}, f = e I g. Distribution matching implies that, with probability 1 over random draws of Z, Z , the following event P happens:

ZI <= Z I = f(Z) <= f(Z ). (54)

Ez,z 1[ P] = 0. (55)

Let W = \I D, Q = \I \D. We showed in the proof of Proposition 3 that

C(I) (z I, z W , z Q), (z I, z W , z Q) B(Z), f(z I, z W , z Q) = f(z I, z W , z Q). (56)

We prove by contradiction. Suppose (z I, z W , z Q), (z I, z W , z Q) B(Z) such that f(z I, z W , z Q) < f(z I, z W , z Q).

(a) Case 1: ZI is discrete. Since f is constant both at (z I, z W , z Q) in the interior of B([z I, z W ]), and at (z I, z W , z Q) in the interior of B([z I, z W ]), we can draw open balls of radius r > 0 around each point, i.e., Br(z Q) B([z I, z W ]) and Br(z Q) B([z I, z W ]), where

z Q Br(z Q), z Q Br(z Q), f(z I, z W , z Q) < f(z I, z W , z Q). (57)

When we draw z, z p(z), let C denote the event that this speciﬁc value of z I is picked for both z, z , and we picked z\I Br(z Q), z \I Br(z Q). Since both balls have positive volume, Pr[C] > 0. However, P does not happen whenever event C happens, since z I = z I but f(z) > f(z ), which contradicts Pr[P] = 1. (b) Case 2: z I is continuous. Similar to case 1, we can draw open balls of radius r > 0 around each point, i.e., Br(z I, z Q) B(z W ) and Br(z I, z Q) B(z W ), where

(z I, z Q) Br(z I, z Q), (z I , z Q) Br(z I, z Q), f(z I, z W , z Q) < f(z I , z W , z Q). (58)

Let H1 = {(z I, z Q) Br(z I, z Q) : z I >= z I}, H2 = {(z I , z Q) Br(z I, z Q) : z I <= z I}. When we draw z, z p(z), let C denote the event that we picked z H1 {z W }, z H2 {z W }. Since H1, H2 have positive volume, Pr[C] > 0. However, P does not happen whenever event C happens, since z I <= z I but f(z) > f(z ), which contradicts Pr[P] = 1.

Therefore we showed

(z I, z W , z Q), (z I, z W , z Q) B(Z), f(z I, z W , z Q) = f(z I, z W , z Q), (59)

i.e., g satisﬁes C(I ; p, g, e ). By symmetry, e satisﬁes C(I ; p , g , e).

I.6 WEAK SUPERVISION IMPOSSIBILITY RESULT

Theorem 2. Weak supervision via restricted labeling, match pairing, or ranking on s I is not sufﬁcient for learning a generative model whose latent code ZI is restricted to SI.

Proof. We construct the following counterexample. Let n = m = 3 and I = {1}. The data generation process is s1 unif([0, 2π)), (s2, s3) unif({(x, y) : x2 + y2 1}), g (s) = [s1, s2, s3]. Consider a generator z1 unif([0, 2π)), (s2, s3) unif({(x, y) : x2 + y2 1}), g(z) =

[z1, cos (z1)z2 sin (z1)z3, sin (z1)z2 + cos (z1)z3]. Then (xd, s I) d= (xg, z I) but R(I; p, g, e ) and R(I ; p , g , e) does not hold. The same counterexample is applicable for match pairing and rank pairing.