# disentanglement_analysis_with_partial_information_decomposition__2c7338b9.pdf

Published as a conference paper at ICLR 2022

DISENTANGLEMENT ANALYSIS WITH PARTIAL INFORMATION DECOMPOSITION

Seiya Tokui The University of Tokyo tokui@g.ecc.u-tokyo.ac.jp

Issei Sato The University of Tokyo sato@g.ecc.u-tokyo.ac.jp

We propose a framework to analyze how multivariate representations disentangle ground-truth generative factors. A quantitative analysis of disentanglement has been based on metrics designed to compare how one variable explains each generative factor. Current metrics, however, may fail to detect entanglement that involves more than two variables, e.g., representations that duplicate and rotate generative factors in high dimensional spaces. In this work, we establish a framework to analyze information sharing in a multivariate representation with Partial Information Decomposition and propose a new disentanglement metric. This framework enables us to understand disentanglement in terms of uniqueness, redundancy, and synergy. We develop an experimental protocol to assess how increasingly entangled representations are evaluated with each metric and conﬁrm that the proposed metric correctly responds to entanglement. Through experiments on variational autoencoders, we ﬁnd that models with similar disentanglement scores have a variety of characteristics in entanglement, for each of which a distinct strategy may be required to obtain a disentangled representation.

1 INTRODUCTION

Disentanglement is a guiding principle for designing a learned representation separable into parts that individually capture the underlying factors of variation. The concept is originally concerned as an inductive bias towards obtaining representations aligned with the underlying factors of variation in data (Bengio et al., 2013) and has been applied to controlling otherwise unstructured representations of data from several domains, e.g., images (Karras et al., 2019; Esser et al., 2019), text (Hu et al., 2017), and audio (Hsu et al., 2019) to name just a few.

While the concept is appealing, deﬁning disentanglement is not clear. After Higgins et al. (2017), generative learning methods with regularized total correlation have been proposed (Kim & Mnih, 2018; Chen et al., 2018); however, it is still not clear if independence of latent variables is essential for better disentanglement (Higgins et al., 2018). Furthermore, it is not obvious to measure disentanglement given true generative factors. Towards understanding disentanglement, it is crucial to deﬁne disentanglement metrics, for which several attempts have been made (Higgins et al., 2017; Kim & Mnih, 2018; Chen et al., 2018; Eastwood & Williams, 2018; Do & Tran, 2020; Zaidi et al., 2020); however, there are still problems to be solved.

Current disentanglement metrics may fail to detect entanglement involving more than two variables. In these metrics, one ﬁrst measures how each variable explains one generative factor and then compares or contrasts them among variables. With such a procedure, we may overlook multiple variables conveying information of one generative factor. For example, let z = (z1, z2) be a representation consisting of two vectors, where each variable in z1 disentangles a distinct generative factor and z2 is a rotation of z1 not axis-aligned with the factors. Since any dimension of z2 alone may convey little information of one generative factor, these metrics do not detect that multiple variables encode one generative factor redundantly. Although this is a simple example, this kind of information sharing may arise in learned representations as well, if not the variables are linearly correlated.

In this work, we present a disentanglement analysis framework aware of interactions among multiple variables. Our key idea is to decompose the information of representation into entangled and disentangled components using Partial Information Decomposition (PID), which is a framework in modern information theory to analyze information sharing among multiple random variables (Williams

Published as a conference paper at ICLR 2022

R(u; v1, v2) U(u; v1 \ v2) U(u; v2 \ v1)

C(u; v1, v2)

I(u; v1) I(u; v2)

I(u; v1, v2)

U: unique information R: redundant information C: complementary information

Figure 1: Information diagram of three variable system in PID. Each circle represents mutual information, and each area separated by them represents a decomposed term in PID. When we substitute a generative factor for u, a latent variable for v1, and the other latent variables for v2, the unique information U(u; v1 \v2) represents the information of the factor disentangled by the latent variable. See Figure 5 in Appendix for the alternative form similar to the ones we will use in Section 3.

& Beer, 2010). As illustrated in Figure 1, the mutual information I(u; v1, v2) = E h log p(u,v1,v2) p(u)p(v1,v2) i

between a random variable u and a pair of random variables (v1, v2) is decomposed into four nonnegative terms: unique information U(u; v1 \ v2) and U(u; v2 \ v1)1, redundant information R(u; v1, v2), and complementary (or synergistic) information C(u; v1, v2). While these partial information terms have no agreed-upon concrete deﬁnitions yet (Bertschinger et al., 2014; Finn & Lizier, 2018; Lizier et al., 2018; Finn & Lizier, 2020; Sigtermans, 2020), we can derive universal lower and upper bounds of the partial information terms only with well-deﬁned mutual information terms. We apply the PID framework to representations learned from data by letting u be a generative factor and v1, v2 be one and the remaining latent variables, respectively. The unique information of a latent variable intuitively corresponds to the amount of disentangled information, while the redundant and complementary information correspond to different types of entangled information. We can quantify disentanglement and multiple types of entanglement through the framework, which enriches our understanding on disentangled representations.

Our contributions are summarized as follows.

PID-based disentanglement analysis framework: We propose a disentanglement analysis framework that captures interactions among multiple variables with PID. With this framework, one can distinguish two different types of entanglement, namely redundancy and synergy, which provide insights on how a representation entangles generative factors.

Tractable bounds of partial information terms: We derive lower and upper bounds of partial information terms. We formulate a disentanglement metric, called UNIBOUND, using the lower bound of unique information. We design entanglement attacks, which inject entanglement to a given disentangled representation, and conﬁrm through experiments using them that UNIBOUND effectively captures entanglement involving multiple variables.

Detailed analyses of learned representations: We analyze representations obtained by variational autoencoders (VAEs). We observe that UNIBOUND sometimes disagrees with other metrics, which indicates multi-variable interactions may dominate learned representations. We also observe that different types of entanglement arise in models learned with different methods. This observation provides us an insight that we may require distinct approaches to remove them for disentangled representation learning.

PROBLEM FORMULATION AND NOTATIONS

Let x be a random variable representing a data point, drawn uniformly from a dataset D. Assume that the true generative factors y = (y1, . . . , y K) are available for each data point; in other words, we can access the subset D(y) D of the data points with any ﬁxed generative factors y. Let z = (z1, . . . , z L) be a latent representation consisting of L random variables. An inference model is provided as the conditional distribution p(z|x). Our goal is to evaluate how well the latent variables z disentangle each generative factor in y. The inference model can integrate out the input as p( |y) = Ep(x|y)[p( |x)] = 1 |D(y)| P

x D(y) p( |x); therefore, we only use y and z in most of our discussions.

1Note that this \ is not a set difference operator. It is just a common notation used in the PID literature to emphasize the unique information is not symmetric and resembles the set difference as depicted in Fig.1.

Published as a conference paper at ICLR 2022

We denote the mutual information between random variables u and v by I(u; v) = E h log p(u,v) p(u)p(v) i ,

the entropy of u by H(u) = E[log p(u)], and the conditional entropy of u given v by H(u|v) = E[log p(u|v)]. We denote a vector of zeros by 0, a vector of ones by 1, and an identity matrix by I. We denote by N(µ, Σ) the Gaussian distribution with mean µ and covariance Σ and denote its density function by N( ; µ, Σ).

2 RELATED WORK

The importance of representing data with multiple variables conveying distinct information has been recognized at least since the 80s (Barlow, 1989; Barlow et al., 1989; Schmidhuber, 1992). The minimum entropy coding principle (Watanabe, 1981), which aims at representing data by random variables z with the sum of minimum marginal entropies P

ℓH(zℓ), is found to be useful for unsupervised learning to remove the inherent redundancy in the sensory stimuli. The resulting representation minimizes the total correlation and is called factorial coding. Recent advancements in disentangled representation learning based on VAEs (Kingma & Welling, 2014) are guided by the same principle as minimum entropy coding (Kim & Mnih, 2018; Chen et al., 2018; Gao et al., 2019).

Understanding better representations, which is tackled from the coding side as above, is also approached from the generative perspective. It is often expected that data are generated from generative factors through a process that entangles them into high dimensional sensory space (Di Carlo & Cox, 2007). As generative factors are useful as the basis of downstream learning tasks, obtaining disentangled representations from data is a hot topic of representation learning (Bengio et al., 2013). Towards learning disentangled representations, it is arguably important to quantitatively measure disentanglement. In that regard, Higgins et al. (2017) established a standard evaluation procedure using controlled datasets with balanced and fully-annotated ground-truth factors. A variety of metrics have then been proposed on the basis of the procedure. Among them, Higgins et al. (2017) and Kim & Mnih (2018) propose metrics based on the deviation of each latent variable conditioned by a generative factor. In contrast, Mutual Information Gap (MIG) (Chen et al., 2018) and its variants (Do & Tran, 2020; Zaidi et al., 2020) are based on mutual information between a latent variable and a generative factor. We extend the latter direction, considering multi-variable interactions.

Barlow (1989) discussed redundancy by comparing the population and the individual variables by their entropies, i.e., total correlation. It is though less trivial to measure redundancy as an information quantity. The PID framework (Williams & Beer, 2010) provides an approach to understanding redundancy among multiple random variables as a constituent of mutual information. The framework provides some desirable relationships between decomposed information terms, while it leaves some degrees of freedom to determine all of them, for which several deﬁnitions have been proposed (Williams & Beer, 2010; Bertschinger et al., 2014; Finn & Lizier, 2018; 2020; Sigtermans, 2020).

The PID framework has been applied to machine learning models. For example, Tax et al. (2017) measured the PID terms for restricted Boltzmann machines using the deﬁnition of Williams & Beer (2010). Yu et al. (2021) took an alternative route, similar to our approach, where they measured linear combinations of PID terms by corresponding linear combinations of mutual information terms. These work aims at analyzing the learning dynamics of models in supervised settings. In contrast, we use the PID framework for analyzing disentanglement in unsupervised representation learning.

3 PARTIAL INFORMATION DECOMPOSITION FOR DISENTANGLEMENT

In this section, we analyze the current metrics and introduce our framework. In Section 3.1, we introduce PID of the system we concern. In Section 3.2, we investigate the current metrics in terms of multi-variable interactions. In Section 3.3 and 3.4, we construct our disentanglement metric with bounds for PID terms. We provide a method of computing the bounds in Section 3.5.

3.1 PARTIAL INFORMATION DECOMPOSITION We tackle the problem of evaluating disentanglement of a latent representation z relative to the true generative factors y from an information-theoretic perspective. Let us consider evaluating how one generative factor yk is captured by the latent representation z. The information of yk captured by z is measured using mutual information I(yk; z) = H(z) H(z|yk).

In a desirably disentangled representation, we expect one of the latent variables zℓto exclusively capture the information of the factor yk. To evaluate a given representation, we are interested in

Published as a conference paper at ICLR 2022

understanding how the information is distributed between a latent variable zℓand the remaining representation z\ℓ= (zℓ )ℓ =ℓ. This is best described by the PID framework, where the mutual information is decomposed into the following four terms. I(yk; z) = R(yk; zℓ, z\ℓ) + U(yk; zℓ\ z\ℓ) + U(yk; z\ℓ\ zℓ) + C(yk; zℓ, z\ℓ). (1) Here, the decomposed terms represent the following non-negative quantities.

Redundant information R(yk; zℓ, z\ℓ) is the information of yk held by both zℓand z\ℓ.

Unique information U(yk; zℓ\ z\ℓ) is the information of yk held by zℓand not held by z\ℓ. The opposite term U(yk; z\ℓ\ zℓ) is also deﬁned by exchanging the roles of zℓand z\ℓ.

Complementary information (or synergistic information) C(yk; zℓ, z\ℓ) is the information of yk held by z = (zℓ, z\ℓ) that is not held by either zℓor z\ℓalone.

The following identities, combined with Eq.1, partially characterize each term. I(yk; zℓ) = R(yk; zℓ, z\ℓ) + U(yk; zℓ\ z\ℓ), I(yk; z\ℓ) = R(yk; zℓ, z\ℓ) + U(yk; z\ℓ\ zℓ). (2) The decomposition of this system is illustrated in Figure 2a.

We expect disentangled representations to concentrate the information of yk to a single latent variable zℓ, and to let the other variables z\ℓnot convey the information in either unique, redundant, or synergistic ways. This is understood in terms of PID as maximizing the unique information U(yk; zℓ\ z\ℓ) while minimizing the other parts of the decomposition.

The above formulation is incomplete as one degree of freedom remains to determine the four terms with the three equalities. Instead of stepping into searching for suitable deﬁnitions of these terms, we build discussions applicable to any such deﬁnitions that fulﬁll the above incomplete requirements2.

3.2 UNDERSTANDING DISENTANGLEMENT METRICS FROM INTERACTION PERSPECTIVE Current disentanglement metrics are typically designed to measure how each latent variable captures a factor and compare it among latent variables, i.e., output a high score when only one latent variable captures the factor well.

For example, the Beta VAE metric (Higgins et al., 2017) is computed by estimating the mean absolute difference (MAD) of two i.i.d. variables following p(zℓ|yk) = Ep(x|yk)p(zℓ|x) for each ℓby Monte-Carlo sampling and training a linear classiﬁer that predicts k from the noisy estimations of the differences. The Factor VAE metric (Kim & Mnih, 2018) is computed similarly, except that the MAD is replaced with variance (following normalization by population), and a majority-vote classiﬁer is used to eliminate a failure mode of ignoring one of the factors and to avoid depending on hyperparameters. These metrics have the same goal of ﬁnding a mapping between ℓand k by comparing the deviation of zℓwhen ﬁxing yk. Since the deviation is computed for each zℓseparately, these metrics do not count how each latent variable zℓinteracts with the other variables z\ℓ.

Another example is MIG (Chen et al., 2018), which compares mutual information I(yk; zℓ) for all ℓand uses the gap between the maximum and the second maximum among them. More precisely, MIG is deﬁned by the following formula.

1 H(yk) max ℓ min ℓ =ℓ(I(yk; zℓ) I(yk; zℓ )). (3)

Here, dividing each summand by H(yk) balances the contribution of each factor when they are discrete. The difference in mutual information is rewritten as a difference in unique information as I(yk; zℓ) I(yk; zℓ ) = U(yk; zℓ\ zℓ ) U(yk; zℓ \ zℓ) U(yk; zℓ\ zℓ ). (4) In that sense, MIG effectively captures the pairwise interactions between latent variables. This metric still ignores interplays between more than two variables. Figure 2b reveals that some of the redundant information R(yk; zℓ, z\ℓ) is positively evaluated in MIG, which should have been considered as a signal of entanglement. Note that there are several extensions to MIG (Do & Tran, 2020; Li et al., 2020; Zaidi et al., 2020); see Appendix B for detailed discussions on them.

2Most studies on PID only deal with discrete systems, while deep representations often include continuous variables. There have been attempts to deﬁne and analyze PID for continuous systems (Barrett, 2015; Schick Poland et al., 2021; Pakman et al., 2021). Note that the domain of variables, which our framework depends on, is not limited with the PID framework.

Published as a conference paper at ICLR 2022

U(yk; zℓ\ z\ℓ) R(yk; zℓ, z\ℓ) U(yk; z\ℓ\ zℓ) C(yk; zℓ, z\ℓ)

(a) PID for systems with yk, zℓ, and z\ℓ

MIG UNIBOUND

U(yk; zℓ\ zℓ ) R(yk; zℓ, zℓ ) U(yk; zℓ \ zℓ) U(yk; zℓ\ z\ℓ) R(yk; zℓ, z\ℓ) U(yk; z\ℓ\ zℓ) C(yk; zℓ, z\ℓ)

(b) Side-by-side comparison of positive and negative terms in UNIBOUND and MIG

Figure 2: Information diagrams depicted by bands (the style borrowed from Figure 8.1 of Mac Kay (2003)). White boxes represent mutual information, which we can compute. (a) The bands depict the decomposition used in the PID-based disentanglement evaluation. (b) This diagram superposes the decomposition for systems with (yk, zℓ, z\ℓ) and (yk, zℓ, zℓ ) where zℓ is the latent variable chosen by MIG evaluation. The green boxes are positively evaluated in MIG (the top colored line) and UNIBOUND (the bottom colored line), while the red boxes are negatively evaluated in them. Observe that MIG positively evaluates a part of the redundancy, namely R(yk; zℓ, z\ℓ) R(yk; zℓ, zℓ ), as it does not take into account the interactions among strict supersets of {zℓ, zℓ }.

3.3 UNIBOUND: NOVEL DISENTANGLEMENT METRIC We can lower bound the unique information in any possible PID deﬁnitions by computable components, as we did in Eq.4. To bound U(yk; zℓ\z\ℓ) instead of U(yk; zℓ\zℓ ), we replace zℓ with z\ℓ, obtaining

U(yk; zℓ\ z\ℓ) U(yk; zℓ\ z\ℓ) U(yk; z\ℓ\ zℓ)

+ = I(yk; zℓ) I(yk; z\ℓ)

where we use [ ]+ = max{0, } as the difference in mutual information is not guaranteed to be nonnegative. The decomposed terms evaluated by the bound is illustrated in the lower part of Figure 2b. It effectively excludes, from the positive term, the effect of interaction between zℓand any other latent variables. In a similar way to MIG, we summarize this bound over all the generative factors to obtain the metric we call UNIBOUND.

UNIBOUND := 1

1 H(yk) max ℓ I(yk; zℓ) I(yk; z\ℓ)

Dividing each summand by the entropy H(yk) has the same role as in MIG; it makes the evaluation fair between factors and eases the comparison as the metric is normalized when yk is discrete.

3.4 OTHER BOUNDS FOR PARTIAL INFORMATION TERMS The UNIBOUND metric is a handy quantity to compare representations by a single scalar, while PID itself may provide more ideas on how a given representation entangles or disentangles the factors. To fully leverage the potential, we derive bounds for all the terms of interest, including redundancy and synergy terms, from both lower and upper sides.

Let II(yk; zℓ; z\ℓ) = I(yk; zℓ) + I(yk; z\ℓ) I(yk; z) be the interaction information of a triple (yk, zℓ, z\ℓ). Using nonnegativity of PID terms, we can derive the following bounds from Eq.1-2. I(yk; zℓ) I(yk; z\ℓ)

+ U(yk; zℓ\ z\ℓ) I(yk; zℓ) II(yk; zℓ; z\ℓ)

+, II(yk; zℓ; z\ℓ)

+ R(yk; zℓ, z\ℓ) min{I(yk; zℓ), I(yk; z\ℓ)}, II(yk; zℓ; z\ℓ)

+ C(yk; zℓ, z\ℓ) min{I(yk; zℓ), I(yk; z\ℓ)} II(yk; zℓ; z\ℓ).

Published as a conference paper at ICLR 2022

Note that all six bounds are computed by arithmetics on I(yk; zℓ), I(yk; z\ℓ), and I(yk; z). We can summarize each lower bound in a similar way as we did in Eq.6 and summarize the corresponding upper bound using the same ℓfor each k as the lower bound. While these bounds only determine the terms as intervals, they provide us enough insight into the type of entanglement dominant in the representation (redundancy or synergy).

3.5 ESTIMATING BOUNDS BY EXACT LOG-MARGINAL DENSITIES When the dataset is fully annotated with discrete generative factors and the inference distribution p(z|x) and its marginals p(zℓ|x), p(z\ℓ|x) are all tractable (e.g., mean ﬁeld variational models), we can compute the bounds in a similar way as is done by Chen et al. (2018) for MIG. Let z S be either of zℓ, z\ℓor z. We denote by D(yk) the subset of the dataset D with a speciﬁc value of yk. The mutual information I(yk; z S) = H(z S|yk) + H(z S) can then be computed by the following formula.

I(yk; z S) = Ep(yk,z S)

log 1 |D(yk)|

x D(yk) p(z S|x)

x D p(z S|x)

Assuming that each generative factor yk is discrete and uniform, we employ stratiﬁed sampling over p(yk, z S). We approximate the expectation over p(z S|yk) by sampling x from the subset D(yk) and then sampling z S from p(z|x) to avoid quadratic computational cost. Following Chen et al. (2018), we used the sample size of 10000 in experiments. We use log-sum-exp function to compute log P p(z S|x) = log P exp(log p(z S|x)) for numerical stability. The PID bounds and the UNIBOUND metric are computed by combining these estimations.

When the inference distribution p(z|x) is factorized, its log marginal log p(z S|x) is computed by just adding up the log marginal of each variable as P

ℓ S log p(zℓ|x). Otherwise, we need to explicitly derive the marginal distribution and compute the log density. For example, when p(z|x) is a Gaussian distribution with mean µ and non-diagonal covariance Σ, p(z S|x) is a Gaussian distribution with mean (µi)i S and covariance (Σij)i,j S. Such a case arises for the attacked model we will describe in the next section.

Let M be the sample size for the expectations, and N = |D| be the size of the dataset. Then, the computation of Eq.8 requires O(MN) evaluations of the conditional density p(z S|x).

4 ENTANGLEMENT ATTACKS

To conﬁrm that the proposed framework effectively captures interactions among multiple latent variables, we apply it to adversarial representations that entangle any generative factors. Instead of making a single artiﬁcial model, we modify a given model that disentangles factors well into a noised version that entangles the original variables. We call this process an entanglement attack.

Let z RL be the representation deﬁned by a given model. Our goal is to design a transform from z to an attacked representation z so that disentanglement metrics fail to capture entanglement unless they correctly handle multi-variable interactions.

As we mentioned in Section 3, metrics that do not take into account the interaction among multiple latent variables may underestimate redundant information. To crystallize such a situation, we ﬁrst design an entanglement attack to inject redundant information into multiple variables. For completeness, we also design a similar attack to inject synergistic information as well.

4.1 REDUNDANCY ATTACK Let U be an L L orthonormal matrix and ϵ RL be a random vector following the standard normal distribution N(0, I). Using a hyperparameter α 0, we deﬁne the redundancy attack by

zred = z αUz + ϵ

The coefﬁcient α adjusts the degree of entanglement. When α = 0, the new representation just appends noise elements to a given representation, which does not affect the disentanglement. Increasing α makes the additional dimensions less noisy, resulting in zred redundantly encoding the information of factors conveyed by the original representation. The mixing matrix U chooses how the information of individual variables in z is distributed to the additional dimensions. If we choose

Published as a conference paper at ICLR 2022

U = I, the dimensions are not mixed; thus considering one-to-one interaction between pairs of variables is enough to capture the redundancy. To mix the dimensions, we can use U = I 2

L11 instead, which is an orthonormal matrix that mixes each variable with the others.

To evaluate mutual information terms for MIG and UNIBOUND metrics after the attack, we need an explicit formula for the inference distribution p( zred|x). When the original model has a Gaussian inference distribution p(z|x) = N(z; µ(x), Σ(x)), the attacked model is a summation of linear transformation of z and a standard normal noise, which results in a Gaussian distribution

p( zred|x) = N zred; µ(x) αUµ(x)

, Σ(x) αΣ(x)U

αUΣ(x) I + α2UΣ(x)U

4.2 SYNERGY ATTACK For completeness, we also deﬁne an attack that entangles representation by increasing synergistic information. Using the same setting with an L L orthonormal matrix U and a noise vector ϵ N(0, I), we construct the synergy attack by

zsyn = αUϵ + z ϵ

Here again, the coefﬁcient α adjusts the degree of entanglement. When α = 0, the attacked version just extends the original representation with independent noise elements. By increasing α, the noise are mixed with the parts conveying the information of factors. The attacked vector zsyn fully conveys the information of factors in z regardless of α, as we can recover the original representation by z = zsyn 1:L αU zsyn L+1:2L. Note that most existing metrics correctly react to this attack since the information of individual variables is destroyed by the noise. The upper bound of the unique information is one of the quantities that positively evaluate synergistic information and is expected to ignore the attack.

5 EVALUATION

We evaluated the metrics in a toy model in Section 5.1 and in VAEs trained on datasets with the true generative factors in Section 5.2. We also performed detailed analyses of VAEs by plotting the PID terms of each factor, which is deferred to Appendix E due to the limited space.

5.1 EXACT ANALYSIS WITH TOY MODEL We consider a toy model with attacks deﬁned in Section 4 as a sanity check. Suppose that data are generated by factors y N(0, I), and we have a latent representation that disentangles them up to noise as z|y N(y, σ2I), where σ > 0. With this simple setting, we can analytically compute MIG and UNIBOUND. For example, when we set σ = 0.1 and K = 5, we can derive the scores after the redundancy attack as MIG = 1

c log 101 1+0.65α2

1+1.01α2 and UNIBOUND = 1

c log 101 1+0.01α2

where c = log(2πe). As a function of α, UNIBOUND decreases faster than MIG. The difference comes from the information distributed among the added dimensions of the attacked vector; while UNIBOUND counts all the entangled information, MIG only deals with one of the added dimensions. Indeed, the amount of the untracked entanglement remains in MIG after taking the limit of α as MIG 1

c log 65, while UNIBOUND 0.

We can also compute the scores after the synergy attack as MIG = UNIBOUND =

1 c log 1 + 1 α2+0.01 , where both metrics correctly capture the injected synergy. Refer to Appendix C for the derivations of the exact scores for arbitrary parameter choice.

5.2 EMPIRICAL ANALYSIS WITH ANNOTATED DATASETS We used the DSPRITES and 3DSHAPES datasets for our analysis. The DSPRITES dataset consists of 737, 280 binary images generated from ﬁve generative factors (shape, size, rotation, and x/y coordinates). The 3DSHAPES dataset consists of 480, 000 color images generated from six generative factors (ﬂoor/wall/object hues, scale, shape, and orientation). All the images have 64 64 pixels. We used all the factors for y, encoded as a set of discrete (categorical) random variables.

We used variants of VAEs as methods for disentangled representation learning: β-VAE (Higgins et al., 2017), Factor VAE (Kim & Mnih, 2018), β-TCVAE (Chen et al., 2018), and Joint VAE (Dupont, 2018). We trained all the models with six latent variables, one of which in Joint VAE is

Published as a conference paper at ICLR 2022

(a) Disentanglement scores (b) PID terms (c) Redundancy attacks

Figure 3: Experimental results for VAEs trained with DSPRITES (top row) and 3DSHAPES (bottom row). (a) Disentanglement scores. (b) Estimated PID terms. Three orange bars in each plot represent the possible values of unique (U), redundant (R), and complementary (C) information, respectively. The top and bottom of each orange area correspond to the upper and lower bounds of the term, computed with Eq.7. (c) Disentanglement scores of β-VAE and β-TCVAE after redundancy attack with varying strength. See Appendix H for a larger version of the plots.

a three-way categorical variable for DSPRITES and a four-way categorical variable for 3DSHAPES. We trained each model eight times with different random seeds and chose the best half of them for each metric to avoid cluttered results due to training instability. We optimized network weights with Adam (Kingma & Ba, 2015). We used the standard convolutional networks used in the literature for the encoder and the decoder. See Appendix D for the details of architectures and hyperparameters.

For disentanglement metrics, we compare Beta VAE metric (Higgins et al., 2017), Factor VAE metric (Kim & Mnih, 2018), MIG (Chen et al., 2018), and the proposed UNIBOUND metric.

We ﬁrst compared the models with each metric as shown in Figure 3a. The trend is basically similar between UNIBOUND and the other metrics, while they disagree in some cases. For example, Factor VAE achieves a higher MIG score for DSPRITES than β-VAE, while its UNIBOUND score is low. As we saw in Figure 2b, such a case occurs when a part of the redundancy R(yk; zℓ, z\ℓ) R(yk; zℓ, zℓ ) is large. This observation indicates that Factor VAE effectively forces each variable to encode information of distinct factors (i.e., one-vs-one redundancy is small), while it fails to avoid entangling the information over multiple variables (i.e., one-vs-all redundancy is large).

We can conﬁrm that the redundancy is indeed large in Factor VAE by computing the PID bounds. In Figure 3b, we plot the aggregated bounds of U(yk; zℓ\ z\ℓ), R(yk; zℓ, z\ℓ), and C(yk; zℓ, z\ℓ). The plot reveals that Factor VAE tends to encode the factors redundantly into multiple latent variables.

To understand the tasks and models more deeply, we evaluated the PID bounds for each factor in Figure 4. We can see that the factors are captured by the models in various ways. For example, βTCVAE succeeds to disentangle the position and scale factors in DSPRITES, while it encodes orientation synergistically. It may reﬂect the inherent difﬁculty of disentangling this factor in DSPRITES, as the image of each shape corresponding to 0 orientation is chosen arbitrarily. These observations help us to choose what kind of inductive biases to introduce into the models.

Published as a conference paper at ICLR 2022

(a) β-VAE (b) Factor VAE (c) β-TCVAE (d) Joint VAE

Figure 4: Estimated PID terms of each factor in DSPRITES and 3DSHAPES. As in Figure 3b, three orange bars in each plot show the range between lower and upper bound estimations of unique information (U), redundant information (R), and complementary information (C). See Figure 6 in Appendix for a larger version and Table 5 for qualitative interpretations.

Factor VAE, on the other hand, tends to encode factors redundantly. This indicates Factor VAE succeeds to make individual variables encode each factor, while it fails to prevent other variables from encoding the same information. Joint VAE also encodes factors redundantly. While it fails to disentangle all the factors, this is the only model that encodes the shape and orientation to single latent variables. This can be viewed as the effect of introducing a discrete variable into the representation.

We can understand large redundancy as the models failing to make variables independent enough. The lower bound of redundancy in Eq.7 is related to independence as

[II(yk; zℓ; z\ℓ)]+ = [I(zk; z\ℓ) I(zℓ; z\ℓ|yk)]+ I(zk; z\ℓ). (11)

Therefore, a high redundancy lower bound indicates large mutual information I(zk; z\ℓ); i.e., the latent variables are highly dependent. We conjecture that Factor VAE, which approximates the total correlation DKL(p(z) Q

ℓp(zℓ)) by a critic, fails to make the critic capture the dependency in z enough in our experiments. Since the MIG score is relatively high, the critic succeeds to capture pairwise dependency, while it fails to capture higher dimensional dependency. The high redundancy in Joint VAE can also be explained by the lack of independence; see Appendix F for details.

As in Figure 3a, some metrics have large deviations from the median. This is caused by the randomness in training rather than in evaluation; see Appendix I for detailed analyses.

As we did for the toy model, we performed entanglement attacks in Figure 3c to assess its effect on each metric in learned representations. We selected the best training trials of β-VAE and β-TCVAE. Here, we only plot the results with the redundancy attack, as each metric already behaves well against the synergy attack. The plot reveals that Beta VAE and Factor VAE metrics do not detect the redundancy injected by the attack. MIG slightly decreases with the attack, which is not signiﬁcant against the score variation between learning methods as we observed in Figure 3a. UNIBOUND strongly reacts against the attack, indicating that it effectively detects the injected redundancy.

6 CONCLUSION

We established a framework of disentanglement analysis using Partial Information Decomposition. We formulated a new disentanglement metric, UNIBOUND, using the unique information bounds, and conﬁrmed with entanglement attacks that the metric correctly responses to entanglement caused by multi-variable interactions which are not captured by other metrics. UNIBOUND sometimes disagrees with other metrics on VAEs trained with controlled datasets, which indicates that multivariable interactions arise not only in artiﬁtial settings but in learned representations. We found that VAEs trained with different methods induce representations with a variety of ratios between PID terms, even if their disentanglement scores are close. It is a major future work to develop learning methods of disentangled representations on the basis of these observations.

Published as a conference paper at ICLR 2022

ACKNOWLEDGMENTS We thank members of Issei Sato Laboratory and researchers in Preferred Networks for fruitful discussions, and thank the reviewers for helpful comments to improve the work.

H.B. Barlow. Unsupervised Learning. Neural Computation, 1(3):295 311, 1989. ISSN 0899-7667. doi: 10.1162/neco.1989.1.3.295.

H.B. Barlow, T.P. Kaushal, and G.J. Mitchison. Finding Minimum Entropy Codes. Neural Computation, 1(3):412 423, 1989. ISSN 0899-7667. doi: 10.1162/neco.1989.1.3.412.

Adam B. Barrett. Exploration of synergistic and redundant information sharing in static and dynamical gaussian systems. Phys. Rev. E, 91:052802, 2015. doi: 10.1103/Phys Rev E.91.052802.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828, 2013. doi: 10.1109/TPAMI.2013.50.

Nils Bertschinger, Johannes Rauh, Eckehard Olbrich, J urgen Jost, and Nihat Ay. Quantifying unique information. Entropy, 16(4):2161 2183, 2014. ISSN 1099-4300. doi: 10.3390/e16042161.

Ricky T. Q. Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, volume 31, 2018.

James J. Di Carlo and David D. Cox. Untangling invariant object recognition. Trends in Cognitive Sciences, 11(8):333 341, 2007. ISSN 1364-6613. doi: 10.1016/j.tics.2007.06.010.

Kien Do and Truyen Tran. Theory and evaluation metrics for learning disentangled representations. In International Conference on Learning Representations, 2020.

Emilien Dupont. Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, volume 31, 2018.

Cian Eastwood and Christopher K. I. Williams. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, 2018.

Patrick Esser, Johannes Haux, and Bj orn Ommer. Unsupervised robust disentangling of latent characteristics for image synthesis. In Proceedings of the Intl. Conf. on Computer Vision (ICCV), 2019.

Conor Finn and Joseph T. Lizier. Pointwise partial information decomposition using the speciﬁcity and ambiguity lattices. Entropy, 20(4), 2018. ISSN 1099-4300. doi: 10.3390/e20040297.

Conor Finn and Joseph T. Lizier. Generalised measures of multivariate information content. Entropy, 22(2), 2020. ISSN 1099-4300. doi: 10.3390/e22020216.

Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan. Auto-encoding total correlation explanation. In Proceedings of the Twenty-Second International Conference on Artiﬁcial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pp. 1157 1166, 2019.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.

Irina Higgins, David Amos, David Pfau, S ebastien Racani ere, Lo ıc Matthey, Danilo J. Rezende, and Alexander Lerchner. Towards a deﬁnition of disentangled representations. Co RR, abs/1812.02230, 2018.

Published as a conference paper at ICLR 2022

Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, Yonghui Wu, and James Glass. Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901 5905, 2019. doi: 10.1109/ICASSP.2019. 8683561.

Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1587 1596, 2017.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 2649 2658, 2018.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR, 2015.

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR, 2014.

Zhiyuan Li, Jaideep Vitthal Murkute, Prashnna Kumar Gyawali, and Linwei Wang. Progressive learning and disentanglement of hierarchical representations. In International Conference on Learning Representations, 2020.

Joseph T. Lizier, Nils Bertschinger, J urgen Jost, and Michael Wibral. Information decomposition of target effects from multi-source interactions: Perspectives on previous, current and future work. Entropy, 20(4), 2018. ISSN 1099-4300. doi: 10.3390/e20040307.

David JC Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003.

Ari Pakman, Dar Gilboa, and Elad Schneidman. Estimating the unique information of continuous variables in recurrent networks. Co RR, abs/2102.00218, 2021.

Kyle Schick-Poland, Abdullah Makkeh, Aaron J. Gutknecht, Patricia Wollstadt, Anja Sturm, and Michael Wibral. A partial information decomposition for discrete and continuous variables. Co RR, abs/2106.12393, 2021.

J urgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863 879, 1992. doi: 10.1162/neco.1992.4.6.863.

David Sigtermans. A path-based partial information decomposition. Entropy, 22(9), 2020. ISSN 1099-4300. doi: 10.3390/e22090952.

Tycho M.S. Tax, Pedro A.M. Mediano, and Murray Shanahan. The partial information decomposition of generative neural network models. Entropy, 19(9), 2017. ISSN 1099-4300. doi: 10.3390/e19090474.

Satosi Watanabe. Pattern recognition as a quest for minimum entropy. Pattern Recognition, 13(5): 381 387, 1981. ISSN 0031-3203. doi: 10.1016/0031-3203(81)90094-7.

Paul L. Williams and Randall D. Beer. Nonnegative decomposition of multivariate information. Co RR, abs/1004.2515, 2010.

Shujian Yu, Kristoffer Wickstrøm, Robert Jenssen, and Jos e C. Pr ıncipe. Understanding convolutional neural networks with information theory: An initial exploration. IEEE Transactions on Neural Networks and Learning Systems, 32(1):435 442, 2021. doi: 10.1109/TNNLS.2020. 2968509.

Julian Zaidi, Jonathan Boilard, Ghyslain Gagnon, and Marc-Andr e Carbonneau. Measuring disentanglement: A review of metrics. Co RR, abs/2012.09276, 2020.

Published as a conference paper at ICLR 2022

A ADDITIONAL DIAGRAMS

I(u; v1, v2)

U(u; v1 \ v2) R(u; v1, v2) U(u; v2 \ v1) C(u; v1, v2)

Figure 5: Alternative diagram of Figure 1 in the band style following Figure 8.1 of Mac Kay (2003).

B RELATIONSHIPS TO OTHER INFORMATION-THEORETIC METRICS

There have been several metrics proposed in the literature for information-theoretic measurement of disentanglement.

For example, Do & Tran (2020) proposed four metrics, namely WSEPIN, WINDIN, RMIG, and JEMMIG. Among them, WSEPIN is computed based on the information gap I(x; zℓ|z\ℓ) = I(x; z) I(x; zℓ). In the PID perspective, this quantity upper bounds the unique information of x held by zℓ, i.e., I(x; zℓ|z\ℓ) U(x; zℓ\ z\ℓ). It is similar to the upper bound of unique information that we derived in Eq.7, as

I(yk; zℓ) [II(yk; zℓ; z\ℓ)]+ I(yk; zℓ) II(yk; zℓ; z\ℓ) = I(yk; z) I(yk; z\ℓ). (12)

In that sense, WSEPIN is an upper bound of unique information, where yk is replaced with x and the nonnegativity of redundant information is ignored. As WSEPIN does not handle generative factors separately, it is useful when the generative factors are unknown, while our approach provides more detailed information of (dis)entanglement when the generative factors are available.

RMIG is an extension to MIG, where the conditioning between yk and x is inverted so that one can compute the quantity with uncontrolled dataset. This approach may be extended for our framework as well, in a similar way to applying the inversion to MIG. We did not use this extension as we only deal with controlled datasets in this paper to contrast our approach with a wider variety of metrics.

JEMMIG is also an extention to MIG, which provides a measurement of how each variable captures only one generative factor. This aspect is also studied by Eastwood & Williams (2018) and Li et al. (2020). As we used a set of simple latent variables each of which is either a one-dimensional real variable or a categorical variable with a small number of arms (three or four depending on the dataset), we expect that each variable does not capture much information of multiple generative factors. Indeed, we did not observe any latent variable selected for more than two generative factors, and when a latent variable is selected for two generative factors, the overall disentanglement score is low so that the effect of duplicated selection is ignorable. We still consider it an interesting future work to extend our approach for analyzing entanglement of multiple generative factors into a single latent variable.

C DERIVATIONS FOR THE EXACT ANALYSIS WITH THE TOY MODEL

In this section, we provide a sketch of analytically computing the metrics for the toy model we used in Section 5.1. We use the following formulae.

Let u be a Gaussian vector with covariance Σ. Then, its entropy is given by H(u) = 1 2 log(2πe|Σ|), where | | is the matrix determinant. When we remove the ℓ-th element from u, the remaining vector u\ℓfollows a Gaussian distribution whose covariance matrix Σ\ℓis obtained by removing the ℓ-th row and column from Σ. The determinant of Σ\ℓ is the cofactor of Σ at the ℓ-th diagonal element, i.e., Σ\ℓ = |Σ|(Σ 1)ℓℓ. With this formula, we can derive the entropy of u\ℓas H(u\ℓ) = H(u) + 1

2 log(Σ 1)ℓℓ.

Suppose that an invertible matrix Σ is written by blocks as Σ = A B C D

. If A and its

Schur complement S := D CA 1B are invertible, the determinant and inverse of Σ are

Published as a conference paper at ICLR 2022

Table 1: Exact values of metrics for toy model. We use U = I 2

K 11 for the attacks. We do not normalize the metrics by H(yk) as y is isotropic and continuous; normalization by it just scales the results by a constant scalar. For the attacked models, we show the limit of α to understand how well each metric reacts to completely entangled representations.

Metric z zred zsyn

MIG 1 2 log 1 + 1 σ2 1 2 log (1+σ2)(1+α2(1+σ2 (1 2

σ2(1+α2(1+σ2))

2 log 1+ 1 (1 2

1 2 log 1 + 1 α2+σ2 0

U lower bound (UNIBOUND)

1 2 log 1 + 1 σ2 1 2 log (1+σ2)(1+α2σ2)

σ2(1+α2(1+σ2)) 0 1 2 log 1 + 1 α2+σ2 0

U upper bound 1 2 log 1 + 1 σ2 1 2 log (1+σ2)(1+α2σ2)

σ2(1+α2(1+σ2)) 0 1 2 log 1 + 1 α2+σ2 0

|Σ| = |A| |S|, Σ 1 = A 1 + A 1BS 1CA 1 A 1BS 1

S 1CA 1 S 1

Similarly, if D and its Schur complement T := A BD 1C are invertible, then

|Σ| = |D| |T|, Σ 1 = T 1 T 1BD 1

D 1CT 1 D 1 + D 1CT 1BD 1

Provided that z is a Gaussian vector with covariance Σ, we can derive the entropies of the attacked vectors. Recall that the redundancy and synergy attacks are deﬁned as follows.

zred = I 0 αU I

, zsyn = I αU 0 I

where U is an orthonormal matrix, α > 0, and ϵ N(0, I). Here, 0 is the zero matrix. These again follow Gaussian distributions, whose covariance matrices are computed as follows.

Cov( zred) = I 0 αU I

αUΣ I + α2UΣU

Cov( zsyn) = I αU 0 I

= α2I + Σ αU αU I

Using Eq.13 and Eq.14, we obtain their determinants and inverses.

Cov( zred) = |Σ|, Cov( zred) 1 = α2I + Σ 1 αU αU I

|Cov( zsyn)| = |Σ|, Cov( zsyn) 1 = Σ 1 αΣ 1U αU Σ 1 I + α2U Σ 1U

Therefore, we obtain the following entropies for each ℓ {1, . . . , L}.

H( zred ℓ) = 1

2 log(2πeΣℓℓ), H( zred \ℓ) = 1

2 log(2πe|Σ|(α2 + (Σ 1)ℓℓ)),

H( zred L+ℓ) = 1

2 log(2πe(1 + α2u ℓΣuℓ)), H( zred \L+ℓ) = 1

2 log(2πe|Σ|),

H( zsyn ℓ) = 1

2 log(2πe(α2 + Σℓℓ)), H( zsyn \ℓ) = 1

2 log(2πe|Σ|(Σ 1)ℓℓ),

H( zsyn L+ℓ) = 1

2 log(2πe), H( zsyn \L+ℓ) = 1

2 log(2πe|Σ|(1 + α2v ℓΣ 1vℓ)),

H( zred) = H( zsyn) = 1

2 log(2πe|Σ|). (15) Here, uℓand vℓare the ℓ-th columns of U and U, respectively. When we use U = I 2

K 11 , these are written as uℓ= vℓ= eℓ 2

K 1, where (e1, . . . , e L) is the standard basis of RL.

Published as a conference paper at ICLR 2022

Table 2: Encoder and decoder architectures used in the DSPRITES and 3DSHAPES experiments. Data ﬂow top to bottom. Conv4x4s2p1 represents spatial convolution layer with kernel size 4x4, stride 2x2, and 1 pixel padding at each side of the image. Conv T represents the transposed convolution layer. FC stands for a fully-connected layer. The last fully-connected layer of the encoder outputs the parameters of the variational posterior distribution, the number of which depends on the model deﬁnition as follows. We used six latent variables in all models, which are all Gaussian except for Joint VAE where one of them is replaced with a categorical variable. As Gaussian variables are parameterized by mean and standard deviation while the categorical variables are parameterized by logits, the ﬁnal feature dimensionality is 12 for the models except for Joint VAE where the dimensionality is 13 for DSPRITE and 14 for 3DSHAPES.

Encoder Decoder Conv4x4s2p1, 32 channels FC, 256 features Conv4x4s2p1, 32 channels FC, 64x4x4 features Conv4x4s2p1, 64 channels Conv T4x4s2p1, 64 channels Conv4x4s2p1, 64 channels Conv T4x4s2p1, 32 channels FC, 256 features Conv T4x4s2p1, 32 channels FC, * features Conv T4x4s2p1, 1 or 3 channels

Recall that, in the toy model, the representation z is drawn from N(y, σ2I), where the generative factors y follow the standard Gaussian distribution. Therefore, the marginal distribution of the representation is p(z) = N(z; 0, (1 + σ2)I). When the representation is conditioned by a single factor yk, it follows p(z|yk) = N(z; ykek, (1 + σ2)I eke k ). By substituting the covariance matrices of these distributions to Σ in Eq.15 and computing their differences, we obtain the mutual information terms as follows.

I(yk; zred ℓ) =

( 1 2 log 1+σ2

σ2 (k = ℓ), 0 (k = ℓ), I(yk; zred L+ℓ) =

1 2 log 1+α2(1+σ2) 1+α2(1+σ2 (1 2

K )2) (k = ℓ),

1 2 log 1+α2(1+σ2) 1+α2(1+σ2 4 K2 ) (k = ℓ),

I(yk; zred \ℓ) =

( 1 2 log 1+α2(1+σ2)

1+α2σ2 (k = ℓ),

1 2 log 1+σ2

σ2 (k = ℓ), I(yk; zred \L+ℓ) = 1

2 log 1 + σ2

I(yk; zsyn ℓ) =

( 1 2 log 1+σ2+α2

σ2+α2 (k = ℓ), 0 (k = ℓ), I(yk; zsyn L+ℓ) = 0, (16)

I(yk; zsyn \ℓ) =

( 0 (k = ℓ), 1 2 log 1+σ2

σ2 (k = ℓ), I(yk; zsyn \L+ℓ) =

1 2 log (1+σ2)(1+σ2+α2) σ2(1+σ2+α2)+α2(1 2

K )2 (k = ℓ),

1 2 log (1+σ2)(1+σ2+α2) σ2(1+σ2+α2)+α2 4 K2 (k = ℓ)

I(yk; zred) = I(yk; zsyn) = 1

2 log 1 + σ2

The metrics in Table 1 are computed from these quantities. In addition, we can compute other partial information terms as follows.

R(yk; zred k , zred \k) = 1

2 log 1 + α2(1 + σ2)

1 + α2σ2 , C(yk; zred k , zred \k) = 0,

R(yk; zsyn k , zsyn \k ) = 0, C(yk; zsyn k , zsyn \k ) = 1

2 log (1 + σ2)(σ2 + α2)

σ2(1 + σ2 + α2) .

We can observe that the redundancy and synergy terms effectively increase with the corresponding attacks.

D MODEL ARCHITECTURES AND TRAINING HYPERPARAMETERS

The architectures of the encoder and decoder used in the experiments are listed in Table 2. We used Re LU nonlinearity at each convolutional layer except for the ﬁnal output of each network. In addition, we used for the critic in Factor VAE a feedforward network with ﬁve hidden layers each of which consists of 1,000 leaky Re LU units with the slope coefﬁcient of 0.2.

Published as a conference paper at ICLR 2022

Table 3: Training hyperparameters used for DSPRITES experiments.

Model β-VAE Factor VAE Joint VAE β-TCVAE Batch size 64 64 64 2,048 Iterations 300,000 300,000 300,000 30,000 Adam α 5 10 4 1 10 4 5 10 4 1 10 3

Adam β1, β2 0.9, 0.999 0.9, 0.999 0.9, 0.999 0.9, 0.999 Critic Adam α - 1 10 4 - - Critic Adam β1, β2 - 0.5, 0.9 - - Discrete variable capacity Cc - - 1.1 - Continuous variable capacity Cz - - 40 - Regularization coefﬁcient β = 4 γ = 35 γ = 150 β = 6

Table 4: Training hyperparameters used for 3DSHAPES experiments.

Model β-VAE Factor VAE Joint VAE β-TCVAE Batch size 64 64 64 2,048 Iterations 500,000 500,000 500,000 50,000 Adam α 1 10 4 1 10 4 1 10 4 1 10 3

Adam β1, β2 0.9, 0.999 0.9, 0.999 0.9, 0.999 0.9, 0.999 Critic Adam α - 1 10 5 - - Critic Adam β1, β2 - 0.5, 0.9 - - Discrete variable capacity Cc - - 1.1 - Continuous variable capacity Cz - - 40 - Regularization coefﬁcient β = 4 γ = 20 γ = 150 β = 4

The hyperparameters used in training are listed in Table 3 and Table 4. For hyperparameters not listed in the tables, we used the values suggested in the original papers.

E FULL RESULTS FOR FACTOR-WISE PID ANALYSES

We illustrate the estimated bounds of PID terms for each factor in Figure 6, whose small version appeared in Figure 4. We summarize these results in Table 5. This table is made by observing and categorizing the plots in Figure 6 into some patterns as follows. Note that we ignore the error bars here.

1. If the lower bound of the unique information is larger than the upper bounds of the redundant and complementary information, mark the plot as disentangled. In this case, the model is determined as successfully disentangling the factor regardless of the concrete deﬁnition of PID. Note that the model does not necessarily learn the factor completely; see the ﬁgure for how much the information of the factor is uniquely captured by a latent variable.

2. If the upper and lower bounds of the redundant information are larger than those of the complementary information by a margin, mark the plot as redundant. In this case, the model entangles the factor in a redundant way. As we analyzed in Section 5.2, this indicates that the latent variables are highly dependent.

3. If the upper and lower bounds of the complementary information are larger than those of the redundant information by a margin, mark the plot as synergistic. In this case, the model entangles the factor in a synergistic way. This may occur even if the latent variables are independent. We may require additional inductive biases to help the model disentangle the factor.

4. Otherwise, mark the plot as ﬂat.

F POSSIBLE EXPLANATIONS FOR HIGH REDUNDANCY IN JOINTVAE

We observed in Table 5 and Figure 6 that Joint VAE suffers from high redundancy in all the factors. To explain this phenomenon, we review the training scheme of Joint VAE (Dupont, 2018). The

Published as a conference paper at ICLR 2022

(a) DSPRITES (b) 3DSHAPES

Figure 6: Large version of Figure 4.

Published as a conference paper at ICLR 2022

Table 5: Qualitative summary of PID decomposition for model-factor pairs. Each pair is explained by the following terms: disentangled: the unique information is larger than other terms; synergistic, redundant: the corresponding PID term is large; ﬂat: no term exceeds others much, which indicates that all three terms are small (i.e., multiple variables contain distinct information of the factor) or both redundancy and synergy are large. Note that these qualitative analyses are only applicable to our experimental settings. In particular, high redundancy of Joint VAE (marked by in the table) may be caused by a capacity hyperparameter. See Appendix G for details.

Dataset Factor β-VAE Factor VAE β-TCVAE Joint VAE

shape ﬂat redundant ﬂat redundant scale synergistic redundant disentangled redundant orientation synergistic ﬂat synergistic redundant position x synergistic redundant disentangled redundant position y synergistic redundant disentangled redundant

ﬂoor hue disentangled disentangled disentangled redundant wall hue redundant redundant redundant redundant object hue redundant redundant redundant redundant scale ﬂat ﬂat synergistic redundant shape ﬂat ﬂat synergistic redundant orientation ﬂat redundant disentangled redundant

Table 6: KL terms and total correlation of latent variables learned by Joint VAE. The values in parentheses show the standard deviation.

Dataset DKL(p(z1|x) U(z1)) DKL(p(z2:L|x) N(z2:L; 0, I)) Total correlation of z

DSPRITES 1.10 ( 0.00) 40.00 ( 0.04) 25.02 ( 0.47) 3DSHAPES 1.10 ( 0.00) 39.99 ( 0.05) 25.98 ( 0.56)

Joint VAE model consists of one categorical variable z1 and L 1 Gaussian variables z2:L. Let U(z1) be the uniform categorical distribution and pd(x|z) be the decoder to be learned simultaneously. Then, the objective function of Joint VAE is

L = Ep(z|x)[log pd(x|z)] γ|DKL(p(z1|x) U(z1)) c1| γ|DKL(p(z2:L|x) N(z2:L; 0, I)) c2|,

which involves three hyperparameters: the regularization coefﬁcient γ, the capacity of the discrete variable c1, and the capacity of continuous variables c2. Throughout training, γ is kept constant, while the capacities c1 and c2 are gradually increased and saturated at the predeﬁned maximum values (Cc and Cz, respectively) in the middle of training. As the capacities are positive at the end of training, this indicates that the KL terms, which include the total correlation of the latent variables (Kim & Mnih, 2018; Chen et al., 2018), are large in the trained model. Therefore, as we discussed in Section 5.2, this model does not add pressure on the representation to be less redundant, which may explain the high redundancy3. See Appendix G for the results with varying Cz, whose results also support the above hypothesis as low capacities induce lower redundancy. This training scheme is designed to align the amount of information captured by discrete and continuous variables, which seem to be successful as we observed, i.e., it effectively captures the shape factor in both datasets compared to the other methods.

G INFLUENCE OF REGULARIZATION HYPERPARAMETERS ON DISENTANGLEMENT

We trained each model with varying hyperparameters (β for β-VAE and β-TCVAE, γ for Factor VAE, and the capcity of continuous variables Cz for Joint VAE) and investigated how the metrics as well as PID terms are affected by the regularization strengths4.

3To conﬁrm that large capacity actually causes large dependency, we measured the KL terms and the total correlation of the trained Joint VAE models. We summarize the results in Table 6. These values indicate that the KL terms are actually close to the capacity hyperparameters (see Table 3 and Table 4), and more than a half of them are occupied by the total correlation. 4For Joint VAE, we chose the ﬁnal capacity of continuous variables instead of the regularization coefﬁcient as we expect the capacity to be more relevant to disentanglement. The former controls the KL divergence

Published as a conference paper at ICLR 2022

(a) β-VAE. The KL term coefﬁcient β is varied.

(b) Factor VAE. The multiplier of total correlation γ is varied.

(c) β-TCVAE. The multiplier of total correlation β is varied.

(d) Joint VAE. The ﬁnal capacity of continuous variables Cz is varied. Note that higher capacity induces lower regularization effect.

Figure 7: Metrics and PID estimations for varying regularization coefﬁcients on DSPRITES. For each model, the left panel shows how each metric reacts to changing the hyperparameter. The right panel shows the PID esitmations of resulting representations. In all plots, the horizontal axis corresponds to the regularization hyperparameter.

Published as a conference paper at ICLR 2022

(a) β-VAE. The KL term coefﬁcient β is varied.

(b) Factor VAE. The multiplier of total correlation γ is varied.

(c) β-TCVAE. The multiplier of total correlation β is varied.

(d) Joint VAE. The ﬁnal capacity of continuous variables Cz is varied. Note that higher capacity induces lower regularization effect.

Figure 8: Metrics and PID estimations for varying regularization coefﬁcients on 3DSHAPES.

Published as a conference paper at ICLR 2022

The results for DSPRITES and 3DSHAPES are illustrated in Fig.7 and Fig.8, respectively. From the left panels, we can observe that the UNIBOUND metric is positively correlated with the regularization strength5. It indicates that the regularization method introduced by each model positively contributes to disentanglement in the PID perspective. We also estimated PID bounds in the right panels. They show some interesting effects of regularization against the types of entanglement. For example, in Joint VAE, the representation has low redundancy when the capacity of continuous variables is small, and the redundancy grows signiﬁcantly when we increase the capacity. It indicates that a large capacity makes the model enforce each latent variable to capture details of the input images, ignoring how the information is redundantly captured by other variables.

H LARGE FIGURES OF EXPERIMENTAL RESULTS

We put large versions of Figure 3 and Figure 4 in Figure 9 and Figure 6, respectively, for ﬁner rendering.

I TRAINING AND EVALUATION STABILITY

We illustrate the disentanglement scores of models trained with eight training seeds in Figure 10 and Figure 11. As each disentanglement metric involves sampling during evaluation, the evaluated score has some randomness even if we ﬁx the training random seed. This plot reveals that the deviation caused by randomness in evaluating disentanglement metrics is much smaller than the deviation caused by randomness in training each model.

between the aggregated posterior of continuous variables and their prior at the end of training, while the latter controls the strength of enforcing the KL divergence to be close to the capacity. See the original paper (Dupont, 2018) for more details. 5In Joint VAE, the hyperparameter controls the ﬁnal KL term; hence, a smaller hyperparameter should induce more disentangled representation.

Published as a conference paper at ICLR 2022

(a) Disentanglement scores

(b) PID terms

(c) Redundancy attacks

Figure 9: Large versions of Fig.3. (Left) Results for DSPRITES. (Right) Results for 3DSHAPES.

Published as a conference paper at ICLR 2022

(b) Factor VAE

(c) β-TCVAE

(d) Joint VAE

Figure 10: Disentanglement scores before aggregating across different training seeds for DSPRITES. We optimized parameters for each model eight times with different random seeds, whose scores are illustrated by the eight boxes in each plot.

Published as a conference paper at ICLR 2022

(b) Factor VAE

(c) β-TCVAE

(d) Joint VAE

Figure 11: Disentanglement scores before aggregating across different training seeds for 3DSHAPES.