# evaluating_disentanglement_of_structured_representations__a2be13af.pdf

EVALUATING DISENTANGLEMENT OF STRUCTURED REPRESENTATIONS

Raphaël Dang-Nhu

We introduce the ﬁrst metric for evaluating disentanglement at individual hierarchy levels of a structured latent representation. Applied to object-centric generative models, this offers a systematic, uniﬁed approach to evaluating (i) object separation between latent slots (ii) disentanglement of object properties inside individual slots (iii) disentanglement of intrinsic and extrinsic object properties. We theoretically show that for structured representations, our framework gives stronger guarantees of selecting a good model than previous disentanglement metrics. Experimentally, we demonstrate that viewing object compositionality as a disentanglement problem addresses several issues with prior visual metrics of object separation. As a core technical component, we present the ﬁrst representation probing algorithm handling slot permutation invariance.

1 INTRODUCTION

A salient challenge in generative modeling is the ability to decompose the representation of images and scenes into distinct objects that are represented separately and then combined together. Indeed, the capacity to reason about objects and their relations is a central aspect of human intelligence (Spelke et al., 1992), which can be used in conjunction with graph neural networks or symbolic solvers to enable relational reasoning. For instance, (Wang et al., 2018) exploit robot body structure to obtain competitive performance and transferability when learning to walk or swim in the Gym environments. (Yi et al., 2019) perform state-of-the-art visual reasoning by using an object based representation in combination with a symbolic model. In the last years, several models have been proposed to learn compositional representations in an unsupervised fashion: TAGGER (Greff et al., 2016), NEM (Greff et al., 2017), R-NEM (Van Steenkiste et al., 2018), MONet (Burgess et al., 2019), IODINE (Greff et al., 2019), GENESIS (Engelcke et al., 2019), Slot Attention (Locatello et al., 2020), Mul MON (Li et al., 2020), SPACE (Lin et al., 2020). They jointly learn to represent individual objects and to segment the image into meaningful components, the latter often being called "perceptual grouping". These models share a number of common principles: (i) Splitting the latent representation into several slots that are meant to contain the representation of an individual object. (ii) Inside each slot, encoding information about both object position and appearance. (iii) Maintaining a symmetry between slots in order to respect the permutation invariance of objects composition. These mechanisms are intuitively illustrated in Figure 1.

To compare and select models, it is indispensable to have robust disentanglement metrics. At the level of individual factors of variations, a representation is said to be disentangled when information about the different factors is separated between different latent dimensions (Bengio et al., 2013; Locatello et al., 2019). At object-level, disentanglement measures the degree of object separation between slots. However, all existing metrics (Higgins et al., 2016; Chen et al., 2018; Ridgeway and Mozer, 2018; Eastwood and Williams, 2018; Kumar et al., 2017) are limited to the individual case, which disregards representation structure. To cite Kim and Mnih (2018) about the Factor VAE metric:

The deﬁnition of disentanglement we use [...] is clearly a simplistic one. It does not allow correlations among the factors or hierarchies over them. Thus this deﬁnition seems more suited to synthetic data with independent factors of variation than to most realistic datasets.

Latent slot 1

Latent slot 2

Latent slot 3

Latent slot k

Object-level disentanglement (ours)

Visual object separation (ARI)

Decoders have shared weights

Composition module

Input image Reconstructed

Figure 1: A compositional latent representation is composed of several slots. Each slot generates a part of the image. Then, the different parts are composed together. Pixel-level metrics measure object separation between slots at visual level, while our framework operates purely in latent space.

As a result, prior work has restricted to measuring the degree of object separation via pixel-level segmentation metrics. Most considered is the Adjusted Rand (ARI) Index (Rand, 1971; Greff et al., 2019), where image segmentation is viewed as a cluster assignment for pixels. Other metrics such as Segmentation Covering (m SC) (Arbelaez et al., 2010) have been introduced to penalize oversegmentation of objects. A fundamental limitation is that they do not evaluate directly the quality of the representation, but instead consider a visual proxy of object separation. This results in problematic dependence on the quality of the inferred segmentation masks, a problem ﬁrst identiﬁed by Greff et al. (2019) for IODINE, and conﬁrmed in our experimental study.

To address these limitations, we introduce the ﬁrst metric for evaluating disentanglement at individual hierarchy levels of a structured latent representation. Applied to object-centric generative models, this offers a systematic, uniﬁed approach to evaluating (i) object separation between latent slots (ii) disentanglement of object properties inside individual slots (iii) disentanglement of intrinsic and extrinsic object properties. We theoretically show that our framework gives stronger guarantees of representation quality than previous disentanglement metrics. Thus, it can safely substitute to them. We experimentally demonstrate the applicability of our metric to three architectures: MONet, GENESIS and IODINE. The results conﬁrm issues with pixel-level segmentation metrics, and offer valuable insight about the representation learned by these models. Finally, as a core technical component, we present the ﬁrst representation probing algorithm handling slot permutation invariance.

2 BACKGROUND

2.1 DISENTANGLEMENT CRITERIA

There exists an extensive literature discussing notions of disentanglement, accounting for all the different deﬁnitions is outside the scope of this paper. We chose to focus on the three criteria formalized by Eastwood and Williams (2018), which stand out because of their clarity and simplicity. Disentanglement is the degree to which a representation separates the underlying factors of variation, with each latent variable capturing at most one generative factor. Completeness is the degree to which each underlying factor is captured by a single latent variable. Informativeness is the amount of information that a representation captures about the underlying factors of variation. Similarly to prior work, the word disentanglement is also used as a generic term that simultaneously refers to these three criteria. It should be clear depending on the context whether it is meant as general or speciﬁc.

2.2 THE DCI FRAMEWORK

For brevity reasons, we only describe the DCI metrics of Eastwood and Williams (2018), which are most closely related to our work. The supplementary material provides a comprehensive overview of alternative metrics. Consider a dataset X composed of n observations x1, . . . , xn, which we assume are generated by combining F underlying factors of variation. The value of the different factors for observation xl are denoted vl 1, . . . , vl F . Suppose we have learned a representation z = (z1, . . . , z L) from this dataset. The DCI metric is based on the afﬁnity matrix R = (Ri,j), where Ri,j measures the relative importance of latent zi in predicting the value of factor vj. Supposing appropriate normalization for R, disentanglement is measured as the weighted average of the matrix row entropy. Conversely, completeness is measured as the weighted average of column entropy. Informativeness is measured as the normalized error of the predictor used to obtain the matrix R.

3 EVALUATING STRUCTURED DISENTANGLEMENT

3.1 contains a high level description of the goals steering our framework. Our metric is formally described in 3.2. In 3.3, we present theoretical results showing that our framework can safely substitute to DCI since it provides stronger guarantees that the selected model is correctly disentangled.

3.1 PRESENTATION OF THE FRAMEWORK

Prior work has focused on measuring disentanglement at the level of individual factors of variation and latent dimensions (that we will refer to as global or unstructured disentanglement). In the case of compositional representations with several objects slots, additional properties are desirable:

1. Object-level disentanglement: Objects and latent slots should have a one-to-one matching. That is, changing one object should lead to changes in a single slot, and vice-versa.

2. Slot disentanglement: Inside a given slot, properties of the object must be disentangled.

3. Slot symmetry: The latent dimension responsible for a given factor (e.g., color) should be invariant across slots. This means that all slots should have the same inner structure.

Our structured disentanglement metric allows to evaluate all these criteria within a single uniﬁed framework. Similarly to DCI, it is based on the afﬁnity matrix (Rˆτ,τ), where Rˆτ,τ measures the relative importance of latent zˆτ in predicting the value of factor vτ. The key novelty compared to DCI is that we propose to measure disentanglement with respect to arbitrary projections of the matrix R. Intuitively, projections correspond to a marginalization operation that selects a subset of hierarchy levels and discards the others. The coefﬁcients Rˆτ,τ that are projected together are summed to create group afﬁnities. Figure 2 gives an intuitive illustration of this process on a toy example.

With object-centric representations, projecting at the object/slot level allows to study the relation of objects and slots without taking their internal structure into consideration. Ultimately this permits to evaluate our object-level disentanglement criterion. Projecting at property level allows to study the internal slot representation independently of the object, for evaluation of both internal slot disentanglement and slot symmetry, with a single metric. The identity projection conserving all levels measures ﬂat (DCI) disentanglement. This generalizes to arbitrary hierarchies in the representation, such as disentanglement of position and appearance, or intrinsic and extrinsic object properties.

3.2 MATHEMATICAL DEFINITION

To formalize our framework most conveniently, we propose to view the afﬁnity matrix R = (Rˆτ,τ) as a joint random variable (X, Y ), where X is a random latent dimension, and Y a random factor. Supposing that R is normalized to have sum one, this means P[X = ˆτ, Y = τ] = Rˆτ,τ. This point of view, which is only implicit in prior work, allows to formalize the projections of R as a coupled marginalization operation (ρ(X), ρ(Y )), where ρ(α1, . . . , αh) = (αe1, . . . , αel) selects a subset of hierarchy levels. This yields a concise and elegant mathematical framework, where we can build on standard information-theoretic identities to derive theoretical properties. In the following, HU (A|B) denotes the conditional entropy of A with respect to B (detailed deﬁnitions in the supplementary).

R1,1 R1,2 R1,3 R1,4

R2,1 R2,2 R2,3 R2,4

R3,1 R3,2 R3,3 R3,4

R4,1 R4,2 R4,3 R4,4

Color Size Color Size

Object 1 Object 2

Factors of variation

Dim 1 Slot 1

R1,1 + R1,2+ R2,1 + R2,2

R1,3 + R1,4 + R2,3 + R2,4

R3,1 + R3,2+ R4,1 + R4,2

R3,3 + R3,4+ R4,3 + R4,4

Object 1 Object 2

R1,1 + R1,3+ R3,1 + R3,3

R1,2 + R1,4 + R3,2 + R3,4

R2,1 + R2,3+ R4,1 + R4,3

R2,2 + R2,4+ R4,2 + R4,4

Projection on slot/object

Projection on dimension/property

Identity projection

Object-level disentanglement

Property-level completeness Unstructured (DCI) completeness

Figure 2: Overview of projections on a toy example of object-centric representation. The afﬁnity scores R are marginalized according to the selected hierarchy levels. In projected space, row entropy measures disentanglement, while column entropy measures completeness. The measure of entropy involves a renormalization by row (for completeness) or column (for disentanglement). The afﬁnity scores R are computed in an all-pairs manner. A mapping can be obtained (e.g. from object to slot) by looking for the maximum of each column.

Our disentanglement criteria are the following:

Completeness with respect to projection ρ measures how well the projected factors are captured by a coherent group of latents, by measuring column entropy in the projected space. It is measured as

C(ρ) = 1 HU (ρ(X)|ρ(Y )) ,

where U = |ρ( ˆT )| is the number of groups of latents in the projection.

Disentanglement with respect to projection ρ measures to what extent a group of latent variables inﬂuences a coherent subset of factors, by measuring projected row entropy. It is measured as

D(ρ) = 1 HV (ρ(Y )|ρ(X)) ,

where V = |ρ (T )| is the number of groups of factors in the projection.

Informativeness does not depend on the projection. It is deﬁned as the normalized error of a low-capacity model f that learns to predict the factor values v from the latents z, i.e.

I = f(z) v 2

The changing log bases U and V aim at ensuring normalization between 0 and 1.

3.3 ESTABLISHING TRUST IN OUR FRAMEWORK: A THEORETICAL ANALYSIS

Our disentanglement D(ρ) and completeness C(ρ) metrics depend on the subset of hierarchy levels contained in the projection ρ. We theoretically analyze the inﬂuence of the projection choice. It is especially interesting to study the behavior when distinct subsets of hierarchy levels are combined. Our key results are the following: Theorem 1 shows that our framework contains DCI as a special case, when the identity projection is chosen. Theorem 2 shows that the metrics associated to the identity projection can be decomposed in terms of one dimensional projections along a single hierarchy level. Together, these results formally show that one dimensional projections provide a sound substitute for prior unstructured metrics. This is very useful to build trust in our framework. Theorem 1 (Relation with DCI). The DCI metrics of Eastwood and Williams (2018) are a special case of our framework, with the identity projection conserving all hierarchy levels.

Theorem 2 uses the intuitive notion of decomposition of a projection. The projection ρ is said to be decomposed into ρ1, . . . ρk if the set of hierarchy levels selected by ρ is the disjunct union of the set of levels selected by ρ1, . . . ρk. In the case of object-centric latent representations, the identity projection ρid that keeps all hierarchy levels {object, property} can be decomposed as ρobject considering only {object} and ρproperty considering only {property}. Theorem 2 (Decomposition of a projection). Consider a decomposition of the projection ρ (with L latent groups) into disjunct projections ρ1, . . . ρk (with respectively L1, . . . , Lk latent groups). The following lower bound for the joint completeness (resp. disentanglement) holds

s=1 C(ρs) 1

log Ls(L) C (ρ) .

Suppose that all one dimensional projections verify C(ρs) 1 ϵ, where ϵ 0. We obtain the lower bound C(ρid) (1 k) + k(1 ϵ) = 1 kϵ for the identity projection. For object-centric representations, this implies that when object-level completeness and property completeness are both perfect (that is, ϵ = 0), then DCI completeness is also perfect. The same works for disentanglement. The supplementary materials contains detailed proofs, a matching upper bound, as well as an explicit formula in the special case k = 2.

4 PERMUTATION INVARIANT REPRESENTATION PROBING

Representation probing aims at quantifying the information present in each latent dimension, to obtain the relative importance Rˆτ,τ of latent zˆτ in predicting factor vτ. Traditional probing methods either use regression feature importance or mutual information to obtain R (see the supplementary). However, these techniques do not account for the permutation invariance of object-centric representations, in which slot reordering leaves the generated image unchanged. To address this, we propose a novel formulation as a permutation invariant feature importance problem. As a core technical contribution of this paper, we present an efﬁcient EM-like algorithm to solve this task in a tractable way. Traditional representation probing ﬁts a regressor f to predict the factors v from latents z, solving

vj f(zj) 2 2 .

Then, the matrix R is extracted from the feature importances of f. In contrast, our formulation of permutation invariant representation probing jointly optimizes on permutations (π1, . . . , πn) over groups of latent dimensions (for instance slots), allowing to account for slot permutation invariance of the representation, thus solving

arg min (π1,...,πn),f

vj f(πj(zj)) 2 2 .

To address the combinatorial explosion created by the joint optimization, we chose an EM-like approach and iteratively optimize f and the permutations (π1, . . . , πn), while keeping the other ﬁxed (Algorithm 1). This method yields satisfying approximate solutions provided a good initialization (See Figure 5 and the supplementary).

Algorithm 1 Permutation-invariant representation probing

Input: Latent codes z1, . . . , zn and factor values v1, . . . , vn for n input images for i = 1 to n_iters do

# M STEP Fit predictor fi to features z1, . . . , zn and targets v1, . . . , vn. # E STEP for j = 1 to n do

πmin arg min π

vj fi(π(zj)) 2 2 # (π ranges on all slot permutations)

zj πmin(zj) Fit fﬁnal to features z1, . . . , zn and targets v1, . . . , vn. Obtain feature importances from fﬁnal.

5 EXPERIMENTAL EVALUATION

Models We compare three different architectures with public Pytorch implementations: MONet, GENESIS and IODINE, which we believe to be representative of object-centric representations. We evaluate two variants of MONet: the ﬁrst one, denoted as "MONet (Att.)" follows the original design in Burgess et al. (2019) and uses the masks inferred by the attention network in the reconstruction. The second, denoted as "MONet (Dec.)" is a variant that instead uses the decoded masks. For all models, we perform an ablation of the disentanglement regularization in the training loss. Ablated models are trained with pure reconstruction loss, denoted as (R). All our ﬁnal feature importances are obtained with random forests.

Datasets We evaluate all models on CLEVR6 (Johnson et al., 2017) and Multi-d Sprites (Matthey et al., 2017; Burgess et al., 2019), with the exception of IODINE that we restricted to Multi-d Sprites for computational reasons, as CLEVR6 requires a week on 8 V100 GPUs per training. These datasets are the most common benchmarks for learning object-centric latent representations. Both contain images generated compositionally. Multi-d Sprites has 2D shapes on a variable color background, while CLEVR6 is composed of realistically rendered 3D scenes. We slightly modiﬁed CLEVR6 to ensure that all objects are visible in the cropped image. Table 3 contains our experimental results at both object and property-level, compared to pixel-level segmentation metrics (ARI and m SC). We highlight some of these numbers in Figure 3. Figure 4 shows the different projections of the afﬁnity matrix R as Hinton diagrams. Figure 5 visualizes the slot permutation learned by our permutation invariant probing algorithm.

Mo NET (Dec) GENESIS IODINE Mo NET (Att) 20

Object-level disentanglement

Full loss Pure reconstruction loss (R)

Mo NET (Dec) GENESIS IODINE Mo NET (Att) 20

Full loss Pure reconstruction loss (R)

70 75 80 85 90 95 100 105 110 ARI

Object-level disentanglement

Mo NET (Dec)

Mo NET (Dec) (R)

GENESIS (R)

Mo NET (Att)

Mo NET (Att) (R)

Figure 3: a) Ablation of disentanglement regularization tends to decrease disentanglement at objectlevel (Multi-d Sprites), b) Ablation of disentanglement regularization tends to increase pixel-level segmentation (measured with ARI on Multi-d Sprites). c) Object-level disentanglement vs. ARI for different models trained on Multi-d Sprites. We observe a negative correlation between both metrics.

Object 1 Object 2 Object 3 Object 4 Object 5

MONet (Att.)

Object 1 Object 2 Object 3 Object 4 Object 5

Object 1 Object 2 Object 3 Object 4 Object 5

MONet (Rec.)

R G B Shape Scale Orient. X Y

R G B Shape Scale Orient. X Y

Object 1 Object 2 Object 3 Object 4 Object 5

R G B Shape Scale Orient. X Y

R G B Shape Scale Orient. X Y

Multi-d Sprites

Figure 4: Projections of the afﬁnity matrix on Multi-d Sprites for IODINE, GENESIS and the two variants of MONet, as Hinton diagrams: The white area is proportional to coefﬁcient value.

a) Before Algorithm 1

b) After Algorithm 1

Figure 5: Effect of EM probing (Alg. 1) on slot ordering for a group of similar inputs. Each row corresponds to the decomposition of a given input between slots. a) is without EM probing, b) with EM-probing. The decomposition is compared to a reference decomposition (on the ﬁrst row). We observe that EM-probing assigns objects to slot in a way that is always consistent to the reference input. That is, similar objects are matched to similar slots.

0 5 10 15 Iterations

Monet (Att) Monet (Dec) Genesis Iodine

Figure 6: Convergence of the EM probing algorithm, for the different models (Multid Sprites). Values can not be directly compared with Table 2 as the ﬁnal predictor has higher capacity.

Table 1: Comparison of the permutation obtained with EM probing vs obtained with Io U matching (multid Sprites, averaged across models). We report (i) percentage of exactly matching permutations (ii) average mismatch between both permutations.

% EXACTLY MATCHING # AVERAGE MISMATCH MEASURED 62 % 1.08 BEST POSSIBLE 100 % 0 WORST POSSIBLE 0 % 5 CHANCE LEVEL 0.83 % 4

5.1 RELATION TO PIXEL-LEVEL SEGMENTATION METRICS.

Our object-level metrics have the same goal as visual metrics such as ARI and m SC: quantifying object separation between slots. Therefore, we observe a general correlation of all metrics, with low values at initialization and improvement throughout training. The main difference is that ARI and m SC evaluate separation at visual level, while our framework purely operates in latent space, measuring the repartition of information (see Figure 1). Consequently, ARI and m SC tend to favor sharp object segmentation masks, while our framework is focused on easy extraction of information from the latent space. This fundamental difference leads to speciﬁc regimes of negative correlation. Figure 3 c) shows that ARI and our object-level metric give very different rankings of trained model performance for Multi-d Sprites, suggesting that our framework addresses the dependence of ARI on sharp segmentation masks identiﬁed by Greff et al. (2019). This is particularly visible for IODINE, which achieves good disentanglement despite its low ARI.

5.2 INFLUENCE OF THE DISENTANGLEMENT REGULARIZATION

Figure 3 a) and b) shows an ablation study of the disentanglement regularization. The ablated models are trained with pure reconstruction loss, without regularization of the latent space. For our disentanglement metrics, we observe a consistent negative impact. This is consistent with observations for unstructured representations (Eastwood and Williams, 2018; Locatello et al., 2019). On the contrary, the ablation tends to improve pixel-level segmentation. This would imply that visual object separation is only caused by architectural inductive biases, which clariﬁes the impact of disentanglement regularization in prior work.

5.3 INSIGHTS ABOUT THE DIFFERENT MODELS

Qualitatively, visual inspection of the different projections of matrix R in Figures 4 and 11 shows a near one-to-one mapping between slots and objects, except some redundancy for the background slot, which is excluded from our metrics. Our quantitative results in Table 3 conﬁrm a near perfect object disentanglement of up to 90% for Multi-d Sprites. For CLEVR6 we consider the value of 50% to be decent, because the logarithmic scale of the conditional entropy tends to create a strong shift towards 0. To give a simple reference, if each object is equally inﬂuenced by 2 slots among 6, the expected value is 1 log6(2) = 57%. Table 3 additionally shows that (i) the robust performance, of GENESIS supports the generalization of its separate mask encoding process and GECO optimizer. (ii) that IODINE is able to get close to the performance of MONet, despite a less stable training process for IODINE which resulted in an outlier with bad performance. (iii) MONet (Att.) obtains better visual separation metrics, while MONet (Dec.) has better disentanglement. This is because using the decoded masks in MONet (Dec.) increases the pressure to accurately encode the masks in the latent representation. On the contrary, MONet (Att.) does not use the decoded masks for reconstruction, and thus does not need to encode them very accurately.

5.4 EXTENSION TO OTHER HIERARCHY LEVELS

Table 2: Projection along a third hierarchy level (GENESIS, Md Sprites).

EXTRINSIC INTRINSIC

MASK LATENT 0.2 0.32 COMPONENT LATENT 0 0.48

To demonstrate generalization to more than two hierarchy levels, we evaluate disentanglement of intrinsic and extrinsic object properties. This speciﬁcally applies to GENESIS, for which each slot is divided in a mask latent and component latent. Intuitively, intrinsic properties (such as shape and color) are related to the nature of the object, whereas extrinsic properties (such as position and orientation) are contextual to the scene. The projection along this third hierarchy level is given in Table 2, for GENESIS trained on Multi-d Sprites. Numbers show that extrinsic properties are successfully captured by the mask latent. However, intrinsic properties are not satisfyingly separated. This is because information about object shape is contained both in the segmentation mask and in the generated image of the object. We believe that recent architectures (Nguyen-Phuoc et al., 2020; Ehrhardt et al., 2020) performing object composition in 3D scene space might solve this problem.

Table 3: Experimental results (stderr. over three seeds, values in %). Models with (R) have no disentanglement regularization. ARI/m SC is slightly higher (resp. lower) than prior work for Multi-d Sprites (resp. CLEVR6,) because of minor differences in the data generation process. At object-level, we observe that disentanglement and completeness match closely, but this is not the case at property-level, due to the asymmetry in the number of latent factors and slot dimensions.

OBJECT-LEVEL PROPERTY-LEVEL MODEL DIS. ( ) COMP. ( ) DIS.( ) COMP.( ) INF.( ) ARI( ) MSC ( )

MONET (ATT.) 52 4 52 4 39 6 60 5 45 3 95 1 83 0.3 MONET (ATT.) (R) 60 4 60 4 29 3 52 1 36 2 98 0.1 86 4 MONET (DEC.) 75 7 75 7 59 6 74 5 26 6 80 4 68 2 MONET (DEC.) (R) 69 0.3 69 0.2 35 1 55 1 32 0.8 91 0.2 69 2 GENESIS 90 0.8 91 0.6 72 0.5 60 0.4 24 0.4 85 0 70 0.3 GENESIS (R) 64 3 65 3 65 1 42 1 30 0.5 96 0.6 84 0.7 IODINE 70 9 72 10 37 5 62 4 37 6 80 8 71 3 IODINE (R) 35 2 36 1 17 2 46 1 61 2 75 4 65 1

MONET (ATT.) 46 5 48 5 22 4 51 3 55 5 93 0.4 68 6 MONET (ATT.) (R) 40 9 41 9 13 5 44 3 61 8 93 0.4 65 5 MONET (DEC.) 47 6 48 6 21 6 50 4 55 5 90 2 65 2 MONET (DEC.) (R) 40 1 41 0.2 13 0.6 44 0.1 60 0.6 92 0.5 69 1 GENESIS 50 2 52 2 47 2 39 2 47 0.9 91 0.4 65 3 GENESIS (R) 43 1 45 2 24 2 21 2 65 1 92 0.2 60 3

5.5 RE-EXAMINING THE MODEL SELECTION PROCESS

These divergences between visual metrics and our framework lead us to reconsider prior model selection processes, which were primarily centered on ARI. To give three striking examples (i) the recent Slot Attention (Locatello et al., 2020) architecture does not use disentanglement regularization. In light of our ablation study, we believe that this is potentially harmful for disentanglement (Section 5.2). (ii) Burgess et al. (2019) and Engelcke et al. (2019) chose to privilege MOnet (Att.) which obtains lower disentanglement that MONet (Dec.) (Section 5.3) (iii) most prior work chose a 2D segmentation mask approach that does not satisfyingly disentangle intrinsic and extrinsic object properties (5.4). We believe that all related existing experimental studies would beneﬁt from the perspective offered by our framework.

6 RELATED WORK

Structured disentanglement metrics. Most similar to our work is the slot compactness metric of Racah and Chandar (2020), also based on aggregation of feature importances to measure object separation between slots. However, it only operates at one hierarchy level and does not handle slot permutation invariance, which is essential to obtain meaningful results. Also related is the hierarchical disentanglement benchmark of Ross and Doshi-Velez (2021), whose ability to learn the hierarchy levels in the representation, is remarkable, but is unfortunately limited in its applicability to toy datasets. Finally, Esmaeili et al. (2019) present a structured variational loss encouraging disentanglement at group-level. There is a high-level connection with our work, but the objective is ultimately different. Esmaeili et al. (2019) evaluate their structured variational loss with unstructured disentanglement metrics.

7 CONCLUSION

Our framework for evaluating disentanglement of structured latent representations addresses a number of issues with prior visual segmentation metrics. We hope that it will be helpful for validation and selection of future object-centric representations. Besides, we took great care in not making any domain speciﬁc assumption, and believe that the principles discussed here apply to any kind of structured generative modelling.

ACKNOWLEDGEMENTS

This work was granted access to the HPC resources of IDRIS under the allocation 2020-AD011012138 made by GENCI. We would like to thank Frederik Benzing, Kalina Petrova, Asier Mujika, and Wouter Tonnon for helpful discussions. This work constitutes the public version of Raphaël Dang-Nhu s Master Thesis at ETH Zürich.

Arbelaez, P., Maire, M., Fowlkes, C., and Malik, J. (2010). Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898 916.

Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828.

Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Higgins, I., Botvinick, M., and Lerchner, A. (2019). Monet: Unsupervised scene decomposition and representation. ar Xiv preprint ar Xiv:1901.11390.

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end object detection with transformers. ar Xiv preprint ar Xiv:2005.12872.

Chen, R. T., Li, X., Grosse, R. B., and Duvenaud, D. K. (2018). Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610 2620.

Cover, T. M. (1999). Elements of information theory. John Wiley & Sons.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1 22.

Eastwood, C. and Williams, C. K. (2018). A framework for the quantitative evaluation of disentangled representations.

Ehrhardt, S., Groth, O., Monszpart, A., Engelcke, M., Posner, I., Mitra, N., and Vedaldi, A. (2020). Relate: Physically plausible multi-object scene synthesis using structured latent spaces. ar Xiv preprint ar Xiv:2007.01272.

Engelcke, M., Kosiorek, A. R., Jones, O. P., and Posner, I. (2019). Genesis: Generative scene inference and sampling with object-centric latent representations. ar Xiv preprint ar Xiv:1907.13052.

Esmaeili, B., Wu, H., Jain, S., Bozkurt, A., Siddharth, N., Paige, B., Brooks, D. H., Dy, J., and Meent, J.-W. (2019). Structured disentangled representations. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 2525 2534. PMLR.

Greff, K., Kaufman, R. L., Kabra, R., Watters, N., Burgess, C., Zoran, D., Matthey, L., Botvinick, M., and Lerchner, A. (2019). Multi-object representation learning with iterative variational inference. ar Xiv preprint ar Xiv:1903.00450.

Greff, K., Rasmus, A., Berglund, M., Hao, T., Valpola, H., and Schmidhuber, J. (2016). Tagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing Systems, pages 4484 4492.

Greff, K., Van Steenkiste, S., and Schmidhuber, J. (2017). Neural expectation maximization. In Advances in Neural Information Processing Systems, pages 6691 6701.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2016). beta-vae: Learning basic visual concepts with a constrained variational framework.

Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2901 2910.

Kim, H. and Mnih, A. (2018). Disentangling by factorising. ar Xiv preprint ar Xiv:1802.05983.

Kumar, A., Sattigeri, P., and Balakrishnan, A. (2017). Variational inference of disentangled latent concepts from unlabeled observations. ar Xiv preprint ar Xiv:1711.00848.

Li, N., Fisher, R., et al. (2020). Learning object-centric representations of multi-object scenes from multiple views. Advances in Neural Information Processing Systems, 33.

Lin, Z., Wu, Y.-F., Peri, S. V., Sun, W., Singh, G., Deng, F., Jiang, J., and Ahn, S. (2020). Space: Unsupervised object-oriented scene representation via spatial attention and decomposition. ar Xiv preprint ar Xiv:2001.02407.

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114 4124.

Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., and Kipf, T. (2020). Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33.

Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. (2017). dsprites: Disentanglement testing sprites dataset. URL https://github. com/deepmind/dsprites-dataset/.[Accessed on: 2018-05-08].

Mena, G., Belanger, D., Linderman, S., and Snoek, J. (2018). Learning latent permutations with gumbel-sinkhorn networks. ar Xiv preprint ar Xiv:1802.08665.

Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., and Yang, Y.-L. (2019). Hologan: Unsupervised learning of 3d representations from natural images. In Proceedings of the IEEE International Conference on Computer Vision, pages 7588 7597.

Nguyen-Phuoc, T., Richardt, C., Mai, L., Yang, Y.-L., and Mitra, N. (2020). Blockgan: Learning 3d object-aware scene representations from unlabelled images. ar Xiv preprint ar Xiv:2002.08988.

Niemeyer, M. and Geiger, A. (2020). Giraffe: Representing scenes as compositional generative neural feature ﬁelds. ar Xiv preprint ar Xiv:2011.12100.

Racah, E. and Chandar, S. (2020). Slot contrastive networks: A contrastive approach for representing objects. ar Xiv preprint ar Xiv:2007.09294.

Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846 850.

Rezende, D. J. and Viola, F. (2018). Generalized elbo with constrained optimization, geco. In Workshop on Bayesian Deep Learning, Neur IPS.

Ridgeway, K. and Mozer, M. C. (2018). Learning deep disentangled embeddings with the f-statistic loss. In Advances in Neural Information Processing Systems, pages 185 194.

Ross, A. S. and Doshi-Velez, F. (2021). Benchmarks, algorithms, and metrics for hierarchical disentanglement. ar Xiv preprint ar Xiv:2102.05185.

Spelke, E. S., Breinlinger, K., Macomber, J., and Jacobson, K. (1992). Origins of knowledge. Psychological review, 99(4):605.

Van Steenkiste, S., Chang, M., Greff, K., and Schmidhuber, J. (2018). Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. ar Xiv preprint ar Xiv:1802.10353.

van Steenkiste, S., Kurach, K., and Gelly, S. (2018). A case for object compositionality in deep generative models of images. ar Xiv preprint ar Xiv:1810.10340.

Wang, T., Liao, R., Ba, J., and Fidler, S. (2018). Nervenet: Learning structured policy with graph neural networks. In International Conference on Learning Representations.

Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., and Tenenbaum, J. B. (2019). Clevrer: Collision events for video representation and reasoning. ar Xiv preprint ar Xiv:1910.01442.

Yu, D., Kolbæk, M., Tan, Z.-H., and Jensen, J. (2017). Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 241 245. IEEE.

Zaidi, J., Boilard, J., Gagnon, G., and Carbonneau, M.-A. (2020). Measuring disentanglement: A review of metrics. ar Xiv preprint ar Xiv:2012.09276.

A ADDITIONAL RELATED WORK

A.1 OBJECT-CENTRIC REPRESENTATION LEARNING

Many models have been developed to perform unsupervised perceptual grouping and learn compositional representations. A ﬁrst line of work (TAGGER (Greff et al., 2016), NEM (Greff et al., 2017), R-NEM (Van Steenkiste et al., 2018)) has focused on adapting traditional Expectation Maximization (EM) (Dempster et al., 1977) methods for data clustering to a differentiable neural setting. Nonetheless, despite strong theoretical foundations, these models do not scale to more complex multi-object datasets such as Multi-d Sprites (Burgess et al., 2019) and CLEVR (Johnson et al., 2017). More recent efforts (MONet (Burgess et al., 2019), IODINE (Greff et al., 2019), GENESIS (Engelcke et al., 2019)), Slot Attention (Locatello et al., 2020)) have focused on application to these datasets.

These architectures differ by how exactly they segment the image between the different slots of the representation. MONet is based on a recurrent attention network that repeatedly separates parts of the image until it is totally decomposed. GENESIS was designed to address a key limitation of MONet, namely that it does not learn a latent representation for the segmentation, which prevents principled sampling of novel scenes. With GENESIS, segmentation masks are separately encoded, and an autoregressive prior on the mask latents is enforced. IODINE uses a different strategy building upon the framework of iterative amortized inference, where an initial arbitrary guess for the posterior is progressively reﬁned. However, this iterative process is very computationally expensive. Slot Attention, as its name indicates, introduces an attention based iterative encoder that is much more computationally efﬁcient than Iodine. The SPACE model (Lin et al., 2020) combines the strength of spatial attention with scene mixture models to further improve applicability. Finally, Mul MON (Li et al., 2020) extends object-centric representations to a setting with multi-view supervision.

In a parallel line of work, object-compositional generationis believed to be a promising inductive bias for GAN-like models (van Steenkiste et al., 2018). Obtaining good representations and producing realistic samples are connected challenges, which both require to develop suitable decoder architectures. Very recent work has been focusing on integrating object compositionality into the GAN framework (Nguyen-Phuoc et al., 2020; Ehrhardt et al., 2020; Niemeyer and Geiger, 2020). The question of disentanglement is also present in this context (Nguyen-Phuoc et al., 2019), but it remains a secondary objective.

A.2 TRADITIONAL DISENTANGLEMENT METRICS

Several metrics (Zaidi et al., 2020) have been developed to evaluate the quality of learned representations and allow comparison between models. We identify two main categories:

Classiﬁer-based metrics The ﬁrst group of metrics is based on ﬁxing the value of a factor of variation and generating several samples sharing this value. Intuitively, if the factors of variation are satisfyingly disentangled in the representation, it should be possible to predict which factor was ﬁxed from the different latents. Inside this category, the metric differ in how they exactly identify the factor. Beta VAE (Higgins et al., 2016) uses a linear classiﬁer to predict the factor index, while Factor VAE uses majority vote and takes the empirical variances as input, with the goal of addressing several robustness issues. Despite these implementation differences, Locatello et al. (2019) have empirically observed that both metrics are strongly correlated across a wide range of tasks and models.

Afﬁnity-based metrics The second category of metrics is based on measuring the afﬁnity between each factor of variation and each latent variable, and evaluating how close these are to a one-to-one mapping between factors and variables. Different ways of quantifying afﬁnity have been proposed: the Mutual Information Gap (Chen et al., 2018) and Modularity (Ridgeway and Mozer, 2018) measure mutual information between factors and variables. The SAP score (Kumar et al., 2017) leverages the linear coefﬁcient of determination R2 obtained when regressing the factor from the latent. Finally, the DCI metric is based on measuring regression feature importance, using either Lasso or random forests. These metrics also differ on how they exactly assess the separation of factors in the latent representation. The Mutual Information Gap and the SAP score measure the gap between the two most relevant variables for a factor. Modularity alternatively suggests to quantify the distance to an

ideal afﬁnity template. Finally, the DCI metric uses the entropy of the normalized feature importances as an indicator of entanglement.

In developing our metric, we have chosen to privilege the afﬁnity-based approach for two reasons. First, it does not require the ability to ﬁx one factor while generating samples, contrary to Beta VAE and Factor VAE. Second, the notion of afﬁnity generalizes in a very ﬂexible way to group of factors and latent variables: it is sufﬁcient to sum the scores of all pairs of factors and variables in the groups. We use regression feature importance as a way of measuring afﬁnity. Moreover, we privilege the entropy measure of separation over afﬁnity-gap methods. Indeed, we believe that entropy captures more information about the repartition of information as it takes into account all coefﬁcients rather than just the top two.

A.3 PERMUTATION INVARIANT LEARNING

Developing algorithms that preserve permutation-invariance is no doubt a major challenge for the machine learning community: similar problematics appear in a variety of applications, ranging from speaker separation in the cocktail-party problem (Yu et al., 2017) to object detection losses (Carion et al., 2020). Compared to the representation probing of IODINE, the Hungarian matching in Slot Attention, and the Io U matching of Mul MON (Li et al., 2020), the fundamental novelty or Algorithm 1 is that it obtains the optimal alignment directly at latent code level (that is, feature level), which is agnostic to the input type. In contrast, the alignment in IODINE and Mul MON operates at image level using the segmentation mask, and the Hungarian matching in Slot Attention aligns factor predictions and labels. Besides, the matching approaches used in Iodine, Slot Attention and Mul MON only identify the most important slot for each target, but the DCI framework requires an importance score for each slot/object pair. For future work, it would be interesting to compare EM-probing with Sinkhorn based approaches for learning latent permutations (e.g. Mena et al. (2018)).

B PROOFS AND ADDITIONAL THEORETICAL RESULTS

B.1 BACKGROUND IN INFORMATION THEORY

The description of our metric leverages concepts originating from the ﬁeld of informationtheory Cover (1999). In this section, we recall some central deﬁnitions and results that we will use in the rest of the manuscript. In all of the following, X, Y and Z denote three discrete random variables deﬁned on a common probability space. We denote x1, . . . , xl (resp y1, . . . , ym and z1, . . . , zn) the potential outcomes of X (resp. Y and Z). K is a positive real number. For brevity reasons, we denote P[xi] for P[X = xi], an similarly for Y and Z and the joint random variables. Deﬁnition 1. The entropy of X in base K quantiﬁes the amount of uncertainty in the potential outcomes of X. It is deﬁned as

HK(X) = E[log K(P(X))] =

i=1 P[xi] log K(P[xi]).

Deﬁnition 2. The conditional entropy of Y given X in base K quantiﬁes the amount of information needed to describe the outcome of Y given than the value of X is known. It is deﬁned as

HK(Y |X) = X

i,j P[xi, yj] log K(P[yj|xi]).

Deﬁnition 3. The mutual information of X and Y in base K is a measure of the amount of information obtained about one of the two variables by observing the other. It is deﬁned as

IK(X; Y ) = X

i,j P[xi, yj] log K

P[xi, yj] P[xi] P[yj]

Deﬁnition 4. The conditional mutual information IK(X; Y |Z) is, in base K, the expected value of the mutual information of X and Y conditioned on the value of Z. Formally,

IK(X; Y |Z) = X

i,j,k P[xi, yj, zk] log K

P[xi, yj, zk] P[xi|zk]P[yj|zk]

The following Lemmas are standard results of information theory. As these are well-known, we did not provide a proof here and we refer to textbooks such as Cover (1999).

Lemma 1 (Change of base). Let K1 be a positive real number. The change of base formula for entropy is

HK1(X) = HK(X)

Similarly, we have for conditional entropy that

HK1(Y |X) = HK(Y |X)

Lemma 2 (Subadditivity of entropy). Let X1, . . . , Xn be n arbitrary discrete random variables. We have

HK(X1, . . . , Xn)

i=1 HK(Xi).

The same holds for conditional entropy, i.e.

HK(X1, . . . , Xn|Y )

i=1 HK(Xi|Y ).

Lemma 3 (Nonnegativity of mutual information). The following holds

HK(X) HK(X|Y ) = IK(X; Y ) 0.

Conditioned on a third variable, this generalizes to

HK(X|Z) HK(X|Y, Z) = IK(X; Y |Z) 0.

Lemma 4 (Relation of joint entropy to individual entropies). Let X1, . . . , Xn be n arbitrary discrete random variables. We have

HK(X1, . . . , Xn) n max i=1 HK(Xi).

The same holds for conditional entropy, i.e.

HK(X1, . . . , Xn|Y ) n max i=1 HK(Xi|Y ).

Lemma 5 (Joint conditional entropy). The following holds

HK(X, Y |Z) = HK(X|Z) + HK(Y |Z) IK(X; Y |Z).

B.2 FORMALIZATION OF THE STRUCTURED SETTING

The speciﬁcity of our setting compared to the DCI metric is that the structured organization of latent variables and factors of variations can not be accurately described by scalar indexes. We propose to account for this structured organization via tuple indexes, in which each element of the tuple is responsible for a hierarchy level. In general, the structure over factors of variation expresses the goal of metric (e.g. measuring object separation), while the structure over latent dimensions depends on the model architecture.

Formally, both structures can be deﬁned as relations. Consider n attributes A1, . . . , An representing levels of hierarchy. Each attribute is associated with a set of possible values (a domain) for factors dom F(Ai) and for latents dom L(Ai). These domains constrain the possible values for the tuple indexes, such that the set of tuples T describing the factors of variation is a subset of Qn i=1 dom F(Ai), and the set of tuples ˆT describing the latent dimensions is a subset of Qn i=1 dom L(Ai). Thus, the factor values for sample i can be denoted (vi τ)τ T , and the representation zi can be written zi = (zi ˆτ)ˆτ ˆT . We also denote |T | = F and | ˆT | = L.

Toy example we consider a data generating process with two objects having two properties each. The ﬁrst attribute A1 is responsible for the object hierarchy level, while A2 is responsible for the property level. The structure over factors is formally deﬁned as

T = {(object 1, color), (object 1, size), (object 2, color), (object 2, size)},

Supposing that there are 2 slots with two dimensions each, the structure over latent dimensions is deﬁned as ˆT = {(slot 1, dim 1), (slot 1, dim 2), (slot 2, dim 1), (slot 2, dim 2)}

B.3 PROOFS OF MAIN PAPER

Theorem 1. With the total projection i = {1, . . . , n}, our metric captures the disentanglement and completeness metrics of the DCI framework (Eastwood and Williams, 2018).

Proof. We will only present the proof for the completeness metric, as the situation for the disentanglement metric is exactly symmetric, when switching X and Y and rows and columns. With the total projection i = {1, . . . , n}, we have ρi(Y ) = Y , ρi(X) = X, U = |ρi( ˆT )| = | ˆT | = L and V = |ρi(T )| = |T | = F.

Therefore, completeness with respect to the total projection is deﬁned as

C(i) = 1 HL (X|Y ) .

According to Deﬁnition 2, we have

HL (X|Y ) = X

τ T ,ˆτ ˆT P[X = ˆτ, Y = τ] log L(P[X = ˆτ|Y = τ])

τ T P[Y = τ] X

ˆτ ˆT P[ˆτ|τ] log L (P[ˆτ|τ]) .

Now, let us observe that the conditional probability P[ˆτ|τ] is exactly the term Pˆτ,τ of the DCI framework that denotes the "probability" of latent ˆτ being important to predict factor τ. Consequently, the conditional entropy can be rewritten as follows

HL (X|Y ) = X

τ T P[Y = τ] X

Pˆτ,τ log L Pˆτ,τ = X

τ T P[Y = τ]HL( P ,τ).

This yields

C(i) = 1 HL (X|Y ) = X

τ T P[Y = τ](1 HL( P ,τ)) = X

τ T P[Y = τ]Cτ.

We observe that Cτ = 1 HL( P ,τ) is exactly the completeness score in capturing factor τ deﬁned in the DCI framework. Besides, P[Y = τ] can be rewritten as

P[Y = τ] = X

ˆτ ˆT P[Y = τ, X = ˆτ] =

ˆτ ˆT Rˆτ,τ P

i ˆT ,j T Ri,j ,

which is exactly the relative generative factor importance used by Eastwood and Williams to construct a weighted average expressing overall completeness. This indicates that our probabilistic view of the afﬁnity matrix and metric naturally captures all components of the DCI framework, including the ﬁnal weighted average step.

For the next theorem, we formally deﬁne the meaning of a decomposition of projections in Deﬁnition 5. Note that the formalism is slightly different as our original notations were simpliﬁed for the main paper. Despite differences in notation, union and decomposition of projections are totally equivalent.

Deﬁnition 5 (Union of projections). Consider k disjunct projections i1 = {e1 1, . . . , e1 l1}, . . . , ik = {ek 1, . . . , ek lk} of the generative model. The union of these projections is deﬁned as

It is a projection of size l1 + . . . + lk. Theorem 2. [Lower bound] Consider k disjunct projections i1, . . . , ik of the relations. Let us suppose that i1, . . . , ik have respectively L1, . . . , Lk groups of latents and F 1, . . . , F k groups of factors. Moreover, assume that the joint projection i = Sk s=1 is has L groups of latents and F groups of factors. The following lower bound for the joint completeness holds

s=1 C(is) 1

log Ls(L) C

Similarly, we have for the disentanglement metric

s=1 D(is) 1

log F s(F) D

Proof. We only detail the proof for the completeness metric since both cases are exactly symmetric. C(i) is deﬁned as follows C(i) = 1 HL (ρi(X)|ρi(Y )) . Now, let us observe that the joint projection ρi and the concatenation of the different projections Qk s=1 ρis are similar up to the permutation of the dimensions that originates from sorting the merged index sequences. This permutation has no inﬂuence on the conditional entropy since it is deﬁned as a joint expectation on ˆT T . Therefore we have that

C(i) = 1 HL

s=1 ρis(X)|

t=1 ρit(Y )

According to Lemma 2, we have the following inequality for the joint conditional entropy

s=1 ρis(X)|

t=1 ρit(Y )

t=1 ρit(Y )

According to Lemma 3, we also know that

t=1 ρit(Y )

HL (ρis(X)|ρis(Y )) .

Applying Lemma 1 to change base, we obtain

HL (ρis(X)|ρis(Y )) = HLs (ρis(X)|ρis(Y ))

log Ls(L) = 1 C(is)

log Ls(L) .

Together with Equation (1), this yields

log Ls(L) C

Since log Ls(L) 1 and 1 C(is) 0, we ﬁnally get

s=1 C(is) 1

log Ls(L) ,

which concludes the proof.

B.4 ADDITIONAL RESULTS

Theorem 3 describes an upper bound for the joint metric attempting to match Theorem 2. However, this upper bound comes in a weaker form which is not totally controllable. Intuitively, this is due to the fact that our metrics convey strictly more information that the unstructured DCI framework. Theorem 3. [Upper bound] Consider k disjunct projections i1, . . . , ik of the relations. Let us suppose that i1, . . . , ik have respectively L1, . . . , Lk groups of latents and F 1, . . . , F k groups of factors. Moreover, assume that the joint projection i = Sk s=1 is has L groups of latents and F groups of factors. The following upper bound for the completeness metric holds

1 max 1 s k

log Ls L As

t =s it(Y )|ρis(Y )

Similarly, we have for the disentanglement metric that

1 max 1 s k

log F s F Bs

ρis(Y ); ρS

t =s it(X)|ρis(X)

Proof. We only prove the bound for the completeness metric, as the case of disentanglement is exactly symmetric. C(i) is deﬁned as follows C(i) = 1 HL (ρi(X)|ρi(Y )) . Similar to the previous proof, we observe that the joint projection ρi and the concatenation of the different projections Qk s=1 ρis are similar up to the permutation of the dimensions that originates from sorting the merged index sequences. This permutation has no inﬂuence on the conditional entropy since it is deﬁned as a joint expectation on ˆT T . Therefore we have that

C(i) = 1 HL

s=1 ρis(X)|

t=1 ρit(Y )

According to Lemma 4, we have the following inequality for the joint conditional entropy

k max s=1 HL

t=1 ρit(Y )

s=1 ρis(X)|

t=1 ρit(Y )

According to Lemma 3, we also know that

t=1 ρit(Y )

= HL (ρis(X)|ρis(Y )) As,

t =s it(Y )|ρis(Y )

Applying Lemma 1 to change base, we obtain

t=1 ρit(Y )

log Ls L As.

Together with Equation (2), this yields

C(i) 1 max 1 s k

log Ls L As

Interpretation of the upper bound The goal of Theorem 3 is to use the individual projections to obtain an upper bound for the joint disentanglement (resp. completeness). Intuitively, the meaning of the bound

C(i) 1 max 1 s k

log Ls L As

is that the joint completeness can not be better than any of the individual completeness. However, this bound is weaker than Theorem 2 because of two main restrictions. First, it is harder to get rid of the log Ls(L) term as L is greater than Ls. One possibility is to notice that L L1 . . . Lk. If we make the additional assumption that L1 . . . Lk, then we obtain log Ls(L) k, which gives us

k + min 1 s k

This ﬁrst assumption can be considered reasonable. But more importantly, we notice an additional interaction term As in the minimum that is based on mutual information between the different projections. This term can not be removed as it is non-negative. The intuition for this conditional mutual information is that it measures the dependence of the different projections. It is 0 exactly when ρis(X) and ρS

t =s it(Y ) are independent conditioned on ρis(Y ). Assuming that this is case,

we obtain the simplest inequality

k + min 1 s k C(is)

However, there is absolutely no guarantee that this assumption is valid. To conclude, the upper bound is weaker than Theorem 2 because it depends on the additional terms As and Bs which account for the degree of dependence between projections and can not be easily controlled. This can be seen as a conﬁrmation that our metrics convey strictly more information that the unstructured DCI framework.

In the following, we study the speciﬁc case where there are only two levels of hierarchy (k = 2). In this case, we can actually derive an exact formula rather two matching bounds. Theorem 4 (Case k = 2). Consider 2 disjunct projections i1 and i2, with respectively L1 and L2 groups of latents, and F 1, F 2 groups of factors. Assume that the joint projection i = i1 i2 has L groups of latents and F groups of factors. For the completeness metric, the following identity holds

C(i) = 1 1 C(i1)

log L1(L) 1 C(i2)

log L2(L) +IL(ρi1(X); ρi1(Y )|ρi2(Y )) +IL(ρi2(X); ρi2(Y )|ρi1(Y )) +IL(ρi1(X); ρi2(X)|ρi1(Y ), ρi2(Y )).

In a symmetric way, for the disentanglement metric,

D(i) = 1 1 D(i1)

log F 1(F) 1 D(i2)

log F 2(F) +IF (ρi1(Y ); ρi1(X)|ρi2(X)) +IF (ρi2(Y ); ρi2(X)|ρi1(X)) +IF (ρi(Y ); ρi2(Y )|ρi1(X), ρi2(X)).

Proof. Again, we only prove the result for the completeness metric as disentanglement is exactly symmetric. In the case where k = 2, the joint completeness is deﬁned as follows

C(i) = 1 HL (ρi(X)|ρi(Y )) = 1 HL (ρi1(X), ρi2(X)|ρi1(Y ), ρi2(Y )) .

With Lemma 5, we can rewrite this last expression as

C(i) = 1 HL (ρi1(X)|ρi1(Y ), ρi2(Y )) HL (ρi2(X)|ρi1(Y ), ρi2(Y )) + IL (ρi1(X); ρi2(X)|ρi1(Y ), ρi2(Y )) .

Applying Lemma 3 twice, we obtain

HL (ρi1(X)|ρi1(Y ), ρi2(Y )) = HL (ρi1(X)|ρi1(Y )) IL (ρi1(X); ρi1(Y )|ρi2(Y ), ρi2(Y )) ,

HL (ρi2(X)|ρi1(Y ), ρi2(Y )) = HL (ρi2(X)|ρi2(Y )) IL (ρi2(X); ρi2(Y )|ρi1(Y ).ρi2(Y )) ,

Using Lemma 1 to change base for the two conditional entropies gives us the ﬁnal equation

C(i) = 1 1 C(i1)

log L1(L) 1 C(i2)

log L2(L) + IL(ρi1(X); ρi1(Y )|ρi2(Y )) + IL(ρi2(X); ρi2(Y )|ρi1(Y ))+ + IL(ρi1(X); ρi2(X)|ρi1(Y ), ρi2(Y )),

which concludes the proof.

C REPRODUCIBILITY

C.1 PERMUTATION INVARIANT FEATURE PROBING

Complementary to our permutation invariant probing algorithm, we use a speciﬁc algorithm for generation of the evaluation dataset. It consists in dividing the evaluation dataset into groups of inputs with similar factor values. First, this helps for the initialization of our EM-like approach, as the object-to-slot assignment exhibits less variation for similar images. Second, this allows to visually verify that the learned permutation is consistent across inputs (Figures 7, 8 and 9). Finally, it breaks the symmetry between objects, allowing for more meaningful visualizations of the projected latent space. To obtain a global metric, we average the values across the different groups. The method for generating the evaluation dataset is summarized in Algorithm 2. Exact hyperparameters are given below, and an ablation study can be found in Section D.

Algorithm 2 Generation of the evaluation dataset

for g = 1 to n_groups do

# Generate initial factor values for the group for f in factors do

Sample initial value initial[f] for factor f for i = 1 to n_samples do

# Sample factor values locally for f in factors do

Sample value image_i[f] near initial[f] Generate image i from factor values image_i

MONet We train MONet exactly as in Burgess et al. (2019), except that we set σfg = 0.1 and σbg = 0.06 which was shown to yield better results in (Greff et al., 2019). Besides, we use GECO (Rezende and Viola, 2018) to automatically tune the β parameter during training. The reconstruction constraint was manually set to ensure satisfying visual results (-1.78 for Multi-d Sprites, -1.75 for CLEVR6). We found it especially important for the beginning of the training to set a minimum value of 1 for β. Failure to do so resulted in frequent local optima for the representation in which the network was stuck. We used the implementation provided by the authors of Engelcke et al. (2019). Note that contrary to the latter, we do not include γ in the GECO framework. We use the value of 0.5 given the authors of MONet. The γ parameter controls a mask reconstruction loss, therefore we did not set it to 0 in MONet (R).

GENESIS We follow the training process described in (Engelcke et al., 2019) and use the implementation provided by the authors, under GNU General Public License. CLEVR6 is not considered in this paper: we used the same parameters as for Multi-d Sprites and changed the reconstruction objective to 1.428.

IODINE We use the parameters described in (Greff et al., 2019), and a third-party implementation.1 For Multi-d Sprites, we had to increase the σ parameter from 0.1 to 0.14 in order to obtain satisfying results with the slight variations in our dataset.

We train all models for 200 epochs. This corresponds to approximately 300 000 updates on Multid Sprites and 450 000 updates on CLEVR6. Models were trained with one to four V100 GPUs. The training times ranges from a few hours to 1.5 days.

C.3 TRAINING DATASETS

Multi-d Sprites For training, we use the dataset and preprocessing code from Engelcke et al. (2019). Note that the datasets contains images composed of 1 to 4 objects, contrary to (Greff et al., 2019) that goes up to 5 objects. This explains the better segmentation values in our setting compared to the latter.

CLEVR6 We generate the training dataset similarly as in Greff et al. (2019), with the modiﬁcation that we add an additional constraint that all objects have to be visible inside the cropped 192x192 image. This modiﬁcation is important for our metric. It tends to generate slightly denser scenes than in previous work. This might partly explain why we get slightly worse ARI for MONet compared to previous work (0.94 against 0.96). Scenes have at most 6 objects.

C.4 EVALUATION

Evaluation datasets The evaluation datasets are sampled in the same way, except that we restrict to images with 4 objects for Multi-d Sprites and images with exactly 6 objects for CLEVR6. We sample 10 local groups (see Algorithm 2) for Multi-d Sprites and 5 for CLEVR6. Each group has 5000 samples, with a 4000/500/500 split for ﬁtting, validation and evaluation of the factor predictor. In Table 7, we give the list of factors for each individual object in both datasets, with the possible values for each factor.

Factor prediction For the temporary predictors in Algorithm 1 (inside the loop), we use a linear model with Ridge regularization. For the ﬁnal predictor, we use a random forest with 10 trees, and a maximum depth of 15. This because the random forest obtains better predictions, while the linear model permits faster iterations. The number of iterations of the loop is set to 100 for Multi-d Sprites. For computational reasons, we reduced this number to 20 for CLEVR6. This reduction does not harm performance as the vast majority of permutations happen in the ﬁrst iterations. Most factors are continuous or ordinal: for these, we encode factor prediction as a regression task. The material factor in CLEVR6 has only two classes and we follow the encoding used in sklearn Ridge Classiﬁer. On both datasets, the shape factor has three classes: we generalize the Ridge Classiﬁer encoding and encode the classes as -1, 0 and -1. We match the classes to these values in order to minimize prediction error.

Max tree depth We tried different max tree depths (5, 10, 15, 20, 25, 30) on a validation set and observed that (i) a value of 15 generally gives better validation performance than 5 and 10 (ii) values > 15 do not signiﬁcantly improve validation performance and get very slow to ﬁt. This is why we chose to always go with 15. Note that this optimal value might be related to the number of samples used to ﬁt the predictor (currently 4000). Note that the validation was done globally, and not per factor.

Metric In order to ﬁlter out noise in the feature importances, we set to 0 to all the relative importance coefﬁcients that are less than 3% of the column maximum in absolute value. In our evaluation of object-level disentanglement, we remove the factors and latent dimensions that are background related. For Multi-d Sprites, the background slot is identiﬁed as the one with higher importance in predicting background color. For CLEVR6, there is no background color, but the background is always reconstructed by the ﬁrst slot in the benchmarked models. The global informativeness metric is an unweighted average over all factors.

1github.com/zhixuan-lin/IODINE, no provided license.

Table 4: Ablation study of Algorithms 1 and 2 for Multi-d Sprites. We report mean and std error over three random seeds. All values are in %. 100 is the best possible score and 0 the worst, except for the informativeness metric where 0 is best. All models are trained with full loss.

OBJECT-LEVEL MODEL DIS. ( ) COMP. ( )

MONET (ATT.) 52 4 52 4 MONET (ATT.) (WITHOUT ALG 1) 30 2 30 2 MONET (ATT.) (WITHOUT ALG 2) 57 1 57 1 MONET (ATT.) (WITHOUT ALG 1 AND 2) 22 0.5 22 0.5

MONET (REC.) 75 7 75 7 MONET (REC.) (WITHOUT ALG 1) 40 3 40 4 MONET (REC.) (WITHOUT ALG 2) 64 4 64 4 MONET (REC.) (WITHOUT ALG 1 AND 2) 24 1 24 1

GENESIS 90 0.8 91 0.6 GENESIS (WITHOUT ALG 1) 89 0.1 90 0.1 GENESIS (WITHOUT ALG 2) 61 0.4 61 0.4 GENESIS (WITHOUT ALG 1 AND 2) 33 0.3 33 0.3

IODINE 70 9 72 10 IODINE (WITHOUT ALG 1) 32 2 33 2 IODINE (WITHOUT ALG 2) 66 1 66 1 IODINE (WITHOUT ALG 1 AND 2) 26 0.9 26 0.9

Table 5: Informativeness and Completeness per object property for all models trained with full loss on Multi-d Sprites.

COMPLETENESS ( ) INFORMATIVENESS ( ) GENESIS IODINE MONET (A) MONET (DEC) GENESIS IODINE MONET (A) MONET (DEC)

R CHANNEL 91 71 88 94 9 16 5 5 G CHANNEL 90 68 81 89 10 17 6 7 B CHANNEL 90 77 86 96 9 16 4 5 SHAPE 30 49 42 50 26 45 59 34 SCALE 32 49 42 55 38 53 66 41 ORIENTATION 24 48 36 45 31 65 97 62 X 34 59 39 69 36 50 75 36 Y 33 63 39 67 35 48 74 34

Table 6: Informativeness and Completeness per object property for all models trained with full loss on CLEVR6.

COMPLETENESS ( ) INFORMATIVENESS ( ) GENESIS MONET (ATT.) MONET (REC.) GENESIS MONET (ATT.) MONET (REC.)

R CHANNEL 54 47 47 47 57 61 G CHANNEL 59 50 46 41 58 62 B CHANNEL 61 54 46 38 47 58 SHAPE 27 40 39 64 66 66 MATERIAL 56 56 53 28 39 41 SIZE 49 64 63 24 28 22 ROTATION NA 36 36 106 107 107 X 27 52 54 38 47 41 Y 31 61 65 30 39 32

a) Before Algorithm 1

b) After Algorithm 1

Figure 7: Visualization of the effect of Algorithm 1 on slot ordering for a group of similar inputs.

a) Before Algorithm 1

b) After Algorithm 1

Figure 8: Visualization of the effect of Algorithm 1 on slot ordering for a group of similar inputs.

a) Before Algorithm 1

b) After Algorithm 1

Figure 9: Visualization of the effect of Algorithm 1 on slot ordering for a group of similar inputs.

Object 1 Object 2 Object 3 Object 4 Object 5

MONet (Att.)

Object 1 Object 2 Object 3 Object 4 Object 5

Object 1 Object 2 Object 3 Object 4 Object 5

MONet (Rec.)

R G B Shape Scale Orient. X Y

R G B Shape Scale Orient. X Y

Object 1 Object 2 Object 3 Object 4 Object 5

R G B Shape Scale Orient. X Y

R G B Shape Scale Orient. X Y

Multi-d Sprites

Figure 10: Projections of the afﬁnity matrix at object-level and property-level for all four models trained with full loss on Multi-d Sprites. (Hinton diagram)

Object 1 Object 2 Object 3 Object 4 Object 5 Object 6

MONet (Att.)

Object 1 Object 2 Object 3 Object 4 Object 5 Object 6

R G B Shape Material Size Rotation X Y

Object 1 Object 2 Object 3 Object 4 Object 5 Object 6

MONet (Rec.)

R G B Shape Material Size Rotation X Y

R G B Shape Material Size Rotation X Y

Figure 11: Projections of the afﬁnity matrix at object-level and property-level for all three models trained with fullloss on CLEVR6 (Hinton diagram).

Table 7: Description of the factors of variation for each individual object in both datasets. For each factor, we give the set of possible values, as well as the exact locality constraints used in Algorithm 2. x denotes the initial value sampled in Algorithm 2. When x has values in an array, we denote i the index such that x = a[i]. The Table indicates whether each of the considered factors is intrinsic or extrinsic. The question of whether to consider object size as intrinsic can be debated, however this does not change the results of our experiments.

DATASET FACTORS OF VARIATION TYPE POSSIBLE VALUES LOCAL CONSTRAINT (ALG. 2)

COLOR (R CHANNEL) INTRINSIC [0, 63, 127, 191, 255] [a[i 1], a[i], a[i + 1]] COLOR (G CHANNEL) INTRINSIC [0, 63, 127, 191, 255] [a[i 1], a[i], a[i + 1]] COLOR (B CHANNEL) INTRINSIC [0, 63, 127, 191, 255] [a[i 1], a[i], a[i + 1]] SHAPE INTRINSIC {circle, square, heart} {circle, square, heart} SCALE INTRINSIC [0, 1, 2, 3, 4, 5] [a[i 1], a[i], a[i + 1]] ORIENTATION EXTRINSIC [0, 1, . . . , 30, 39] [a[i 3], a[i 2], . . . , a[i + 2], a[i + 3]] X COORD. EXTRINSIC [0, 1, . . . , 30, 31] [a[i 2], . . . , a[i + 2]] Y COORD. EXTRINSIC [0, 1, . . . , 30, 31] [a[i 2], . . . , a[i + 2]]

COLOR (R CHANNEL) INTRINSIC [0, 0.2, 0, 4, 0.6, 0.8, 1] [a[i 1], a[i], a[i + 1]] COLOR (G CHANNEL) INTRINSIC [0, 0.2, 0, 4, 0.6, 0.8, 1] [a[i 1], a[i], a[i + 1]] COLOR (B CHANNEL) INTRINSIC [0, 0.2, 0, 4, 0.6, 0.8, 1] [a[i 1], a[i], a[i + 1]] SHAPE INTRINSIC {cube, sphere, cylinder} {cube, sphere, cylinder} MATERIAL INTRINSIC {rubber, metal} {rubber, metal} SIZE INTRINSIC [0.35, 0.7] [0.35, 0.7] ROTATION EXTRINSIC J0, 360K J0, 360K X COORD. EXTRINSIC J 3, 3K Jx 0.7, x + 0.7K J 3, 3K Y COORD. EXTRINSIC J 3, 3K Jx 0.7, x + 0.7K J 3, 3K

D ADDITIONAL EXPERIMENTAL RESULTS

D.1 VISUALIZATION OF THE PERMUTATION INFERRED BY OUR FEATURE IMPORTANCE ALGORITHM

In this Section, we inspect the slot ordering inferred by Algorithm 1, by visualizing the masked reconstruction for each slot. Note that Algorithm 1 does not make use of these reconstructions. In Figures 7,8 and 9 we compare slot ordering before (on the left) and after (on the right) applying the algorithm, for two groups of similar images (generated following Algorithm 2). We include all the considered architectures, trained on Multi-d Sprites with variational loss.

On the left, we observe that the slot ordering inferred by the models is not consistent across similar images. This effect is particularly visible for IODINE, despite the fact that we set the model internal noise to a deterministic value (following observations in Greff et al. (2019)). The background slot is not even deterministic. This unpredictable ordering is extremely detrimental to the accuracy of our metric. Applying our metric to IODINE with this initial ordering yielded extremely poor values which do not do justice to the good representation learned by this model.

On the right, we see that Algorithm 1 successfully learns a consistent slot ordering, and puts matching objects at the same position. We notice almost perfect alignment, except for some failure cases with IODINE, where the background slot is switched with a missing object. We are uncertain as to the reasons for this failure, but it has limited impact on ﬁnal performance as we remove the background factors in the object-level projection of the afﬁnity matrix.

D.2 VISUALIZATIONS OF THE LATENT SPACE

In Figures 10 and 11, we show visualizations of the projected latent space for the different architectures (trained with variational loss) for a group of inputs in both evaluation datasets. The qualitative inspection of the object-level projection is consistent with the numerical comparison: GENESIS is visually more disentangled, while MONet and IODINE achieve inferior but still satisfying disentanglement. On Multi-d Sprites, the under-performance of the MONet (Att.) variant is also visible. On Multi-d Sprites, we observe that the GENESIS background slot also contains information about all

objects. This is not surprising as the background mask is in some sense a negative of the different object masks. However, this phenomenon is less marked with other architectures. We think that the speciﬁc mask encoding of GENESIS somehow strengthens this duplication of information between background and object slot. Note that this does not harm performance as we removed the background slot in the ﬁnal metric. We believe that this post-processing is fair because of the expected duplication effect described above.

Qualitative comparison of the property-level projections is harder due to the higher number of dimensions. Still, some observations can be made. Among object properties, color consistently obtain the best separation in the representation. On the GENESIS model, we also notice a clear separation between the last 16 dimensions that correspond to the component latent and the rest of the slot. Color is almost exclusively encoded by the component latent.

D.3 DECOMPOSITION OF THE INFORMATIVENESS AND COMPLETENESS PER FACTOR

In Table 5, we decompose informativeness of the different models per property. Note that the completeness results do not necessarily average to the global object-level completeness as our metric involves a weighted average. On Multi-d Sprites, we observe that the superior results of GENESIS mostly comes from the group of extrinsic factors that are related to the segmentation mask. This would support the hypothesis that the innovative mask encoding of GENESIS is at least partly responsible for the performance increase. However, this effect is less clear on CLEVR6.

On CLEVR6, we notice particularly bad metrics for the rotation property. This is not surprising as the rotation parameter is completely useless for two of the three shapes (sphere and cylinder) and redundant for the last one (cube). Because of this bad identiﬁability and afﬁnity scores thresholding, we can not compute rotation completeness for GENESIS. This does not impact the ﬁnal metric due to the weighting scheme of our metric. Concerning color channels, we note very different behavior between Multi-d Sprites and CLEVR6. We believe that this is due to a particularity of CLEVR6 which is that the set of color is restricted in the training dataset.

D.4 ABLATION OF ALGORITHMS 1 AND 2

In Table 4, we compare the object-level values given by our metric with and without Algorithms 1 and 2. We observe a signiﬁcant performance drop when abalating these algorithms. This is consistent with visual inspection of the slot ordering.