# quantifying_and_learning_linear_symmetrybased_disentanglement__a522e7a3.pdf

Quantifying and Learning Linear Symmetry-Based Disentanglement

Loek Tonnaer * 1 2 Luis A. P erez Rey * 1 2 3 Vlado Menkovski 1 2 Mike Holenderski 1 2 Jacobus W. Portegies 1 2

The definition of Linear Symmetry-Based Disentanglement (LSBD) formalizes the notion of linearly disentangled representations, but there is currently no metric to quantify LSBD. Such a metric is crucial to evaluate LSBD methods and to compare to previous understandings of disentanglement. We propose DLSBD, a mathematically sound metric to quantify LSBD, and provide a practical implementation for SO(2) groups. Furthermore, from this metric we derive LSBD-VAE, a semi-supervised method to learn LSBD representations. We demonstrate1 the utility of our metric by showing that (1) common VAE-based disentanglement methods don t learn LSBD representations, (2) LSBD-VAE, as well as other recent methods, can learn LSBD representations needing only limited supervision on transformations, and (3) various desirable properties expressed by existing disentanglement metrics are also achieved by LSBD representations.

1. Introduction

Learning low-dimensional representations that disentangle the underlying factors of variation in data is considered an important step towards interpretable machine learning with good generalization. To address the fact that there is no consensus on what disentanglement entails and how to formalize it, Higgins et al. (2018) propose a formal definition for Linear Symmetry-Based Disentanglement, or LSBD, arguing that underlying real-world symmetries give exploitable structure to data (see Sect. 3).

*Equal contribution 1Eindhoven University of Technology (TU/e), Eindhoven, The Netherlands 2Eindhoven Artificial Intelligence Systems Institute (EAISI), Eindhoven, the Netherlands 3Prosus, Amsterdam, The Netherlands. Correspondence to: Loek Tonnaer <l.m.a.tonnaer@tue.nl>, Luis A. P erez Rey <l.a.perez.rey@tue.nl>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). 1Code available at https://github.com/ luis-armando-perez-rey/lsbd-vae

LSBD emphasizes that the variability in data observations is often due to some transformations, and that good data representations should reflect these transformations. A typical setting is that of an agent interacting with its environment. An action of the agent will transform some aspect of the environment and its observation thereof, but keeps all other aspects invariant. It is often easy and cheap to register the actions that an agent performs and how they transform the observed environment, which can provide useful information for learning disentangled representations.

However, there is currently no general metric to quantify LSBD. Such a metric is crucial to properly evaluate methods aiming to learn LSBD representations and to relate LSBD to previous definitions of disentanglement. Although previous works have evaluated LSBD by measuring performance on downstream tasks (Caselles-Dupr e et al., 2019) or by measuring specific traits related to LSBD (Painter et al., 2020; Quessard et al., 2020), none of these evaluation methods directly quantify LSBD according to its formal definition.

We propose DLSBD, a well-formalized and generally applicable metric that quantifies the level of LSBD in learned data representations (Sect. 4). We show an intuitive justification of this metric, as well as its theoretical derivation. We also provide a practical implementation to compute DLSBD for common SO(2) symmetry groups. Furthermore, we show that our metric formulation can be used to derive a semisupervised method to learn LSBD representations, which we call LSBD-VAE (Sect. 5). To make LSBD-VAE more widely applicable, we also demonstrate how to disentangle symmetric properties from other non-symmetric properties, and how to quantify this disentanglement with DLSBD.

We show the utility of DLSBD by quantifying LSBD in a number of settings, for a variety of datasets with underlying SO(2) symmetries and other non-symmetric properties (Sect. 6 & 7). First, we evaluate common VAE-based disentanglement methods and show that most don t learn LSBD representations. Second, we evaluate LSBD-VAE and other recent methods that specifically target LSBD, showing that they can obtain much better DLSBD scores while needing only limited supervision on transformations. Third, we compare DLSBD with existing disentanglement metrics, showing that various desirable properties expressed with these metrics are also achieved by LSBD representations.

Quantifying and Learning Linear Symmetry-Based Disentanglement

2. Related Work

Plenty of works have focused on learning and quantifying disentangled representations recently, but research has shown that there is little consensus about the exact definition of disentanglement and methods often do not achieve it as well as they proclaim (Locatello et al., 2019). To introduce some much-needed formalization, Higgins et al. (2018) proposed to define disentanglement with respect to symmetry transformations acting on the data. They used group theory to provide two formal definitions, which we refer to as (Linear) Symmetry-Based Disentanglement, or (L)SBD. In this paper we focus only on LSBD, not SBD.

Several methods have been proposed to learn LSBD representations (Caselles-Dupr e et al., 2019; Painter et al., 2020; Quessard et al., 2020). These methods also learn to represent the transformations acting on the input data, assuming various levels of supervision on these transformations. Other methods have previously focused on capturing transformations of the data outside the context of disentanglement as well (Cohen & Welling, 2015; Sosnovik et al., 2019; Worrall et al., 2017).

Although some of these works do propose metrics that measure some aspect of LSBD, none of them provide a general metric that directly quantifies LSBD according to its formal definition and for any data representation. Painter et al. (2020) mention two metrics: Independence Score measures whether the actions of the subgroups have effects on independent vector spaces, Factor Leakage only measures the number of dimensions in which the subgroup actions are encoded, which is not a property required by LSBD. Neither are general quantifications of LSBD. Additionally, Quessard et al. (2020) also propose a metric , but this is in fact a loss component particular to their group representation parameterization and cannot be used as a general metric for LSBD.

3. Linear Symmetry-Based Disentanglement

Higgins et al. (2018) provide a formal definition of linear disentanglement that connects symmetry transformations affecting the real world (from which data is observed) to the internal representations of a model. The definition is grounded in concepts from group theory, we provide a more detailed description of these concepts in Appendix A.

The definition2 considers a group G of symmetry transformations acting on the data space X through the group action : G X X. In particular, G can be decomposed as the direct product of K groups G = G1 . . . GK. A

2The original definition actually considers an additional set of world states W, but our definition is more practical and can be shown to be the same under mild conditions, see Appendix B.

model s internal representation of data is modeled with the encoding function h : X Z that maps data to the embedding space Z. The definition for Linearly Symmetry-Based Disentangled (LSBD) representations then formalizes the requirement that a model s encoding h should reflect and disentangle the transformation properties of the data, and that the transformation properties of the model s encoding should be linear. The exact definition is as follows:

Definition: Linear Symmetry-Based Disentanglement (LSBD) A model s encoding map h : X Z, where Z is a vector space, is LSBD with respect to the group decomposition G = G1 . . . GK if

1. there is a decomposition of the embedding space Z = Z1 . . . ZK into K vector subspaces,

2. there are group representations for each subgroup in the corresponding vector subspace ρk : Gk GL(Zk), k {1, . . . , K}

3. the group representation ρ : G GL(Z) acts on Z as

ρ(g) z = (ρ1(g1) z1, . . . , ρK(g K) z K), (1)

for g = (g1, . . . , g K) G and z = (z1, . . . , z K) Z with gk Gk and zk Zk.

4. the map h is equivariant with respect to the actions of G on X and Z, i.e. , for all x X and g G it holds that h(g x) = ρ(g) h(x).

Furthermore, we say that a group representation ρ is linearly disentangled with respect to the group decomposition G = G1 . . . GK if it satisfies criteria 1 to 3 from the LSBD definition above.

4. Quantifying LSBD: DLSBD

4.1. Intuition: Measuring Equivariance with Dispersion

To motivate our metric, let s first assume a setting in which a suitable linearly disentangled group representation ρ is known. Let s further assume that the dataset of observations can be expressed with respect to G acting on some base point x0 X, i.e. {xn}N n=1 = {gn x0}N n=1. Formally, this assumes that the action of G on X is regular. In this case, we can use the inverse group elements g 1 n to transform each data point toward the base point x0, i.e.

x0 = g 1 1 x1 = . . . = g 1 N x N. (2)

Since ρ is linearly disentangled, we only need to measure the equivariance of the encoding map h to quantify LSBD. Equivariance is achieved when h(g x) = ρ(g) h(x), for all

Quantifying and Learning Linear Symmetry-Based Disentanglement

Figure 1: A dataset of images from a rotating object expressed in terms of the group G = SO(2) acting on a base image x0. It is possible to quantify the level of LSBD of an encoding map h by measuring its equivariance with respect to a group representation ρ. Since all data has been generated from x0, equivariance can be measured as the dispersion of the points {ρ(g 1 n ) h(xn)}N n=1.

g G, x X. Given the dataset described above, we can check this property for x {xn}N n=1 and g {gn}N n=1.3

In particular, from Equation (2) we can see that we have equivariance if

h(x0) = ρ(g 1 1 ) h(x1) = . . . = ρ(g 1 N ) h(x N). (3)

This not only characterizes perfect equivariance, but also allows for an efficient way to quantify how close we are to true equivariance, by measuring the dispersion of the points {ρ(g 1 n ) h(xn)}N n=1.4 Given a suitable norm Z in Z, we can thus quantify LSBD in this setting as

ρ(g 1 n ) h(xn) M 2 Z ,

n =1 ρ(g 1 n ) h(xn ),

i.e. we compute the mean M of {ρ(g 1 n ) h(xn)}N n=1 and use the average squared distance to this mean for points in {ρ(g 1 n ) h(xn)}N n=1 as our LSBD metric, see Fig. 1.

However, this formulation requires knowing the right linearly disentangled group representation and a suitable norm in Z. Moreover, it implicitly assumes a uniform probability measure over the group elements {gn}N n=1. In the next section we formulate our metric for a more general setting.

3Note that {gn}N n=1 can be used to describe all known group transformations between elements in the dataset by means of composition and inverses, since xi = gi (g 1 j xj). Thus it suffices to check equivariance for these N group transformations. 4Note that we do not actually need to know x0 nor h(x0).

4.2. DLSBD: A Metric for LSBD

Generalizing the ideas from the previous section with concepts from measure theory, we propose a metric to measure the level of LSBD of any encoding h : X Z given a data probability measure µ on X, provided that µ can be written as the pushforward GX( , x0)#ν of some probability measure ν on G by the function GX( , x0) for some base point x0. More formally,

µ(A) = GX( , x0)#ν(A)

= ν ({g G | GX(g, x0) A}) , (5)

for Borel subsets A X. Note that this is only possible if the action GX is transitive.

For example, the situation of a dataset with N datapoints {xn}N n=1 = {gn x0}N n=1 corresponds to the case in which ν and µ are empirical measures on the group G and data space X, respectively:

i=1 δgi, µ := 1

i=1 δxi. (6)

We define the metric DLSBD for an encoding h and a measure µ as

inf ρ P(G,Z)

ρ(g) 1 h(g x0) Mρ,h,x0 2 ρ,h,µ dν(g),

with Mρ,h,x0 = Z

G ρ(g ) 1 h(g x0)dν(g ), (7)

where the norm ρ,h,µ is a Hilbert-space norm depending on the representation ρ, the encoding map h : X Z, and the data measure µ. More details of this norm can be found in Appendix C. Moreover, P(G, Z) denotes the set of linearly disentangled representations of G in Z. Lower values of DLSBD indicate better disentanglement, zero being optimal.

4.3. Practical Computation of DLSBD

There are two main challenges for computing the metric of Equation (7). First, to calculate the integrals in the formula, all possible datapoints that can be expressed as g x0 with g G = G1 GK must be available. Second, the infimum of the integrals over all possible linearly disentangled representations must be estimated. This requires finding the possible invariant subspaces Z = Z1 ZK induced by the encoding h over which the group representations are disentangled.

We present a practical implementation of an upper bound to DLSBD for an encoding function h given a dataset X

Quantifying and Learning Linear Symmetry-Based Disentanglement

Figure 2: Consider a dataset modeled by a group decomposition G = G1 GK acting on x0 and embedded in a latent space Z via h. In this example the subgroup Gk = SO(2) models the rotations of an airplane. Other subgroups G =k could also be acting e.g. changes in airplane color. The first step to calculate the disentanglement of Gk is to construct a set of data embeddings Zk Z whose variability is due to Gk. These embeddings are then projected into a 2-dimensional space through PCA. For these projected embeddings we can describe the group representations in a simple parametric form ρk,w. For a given ρk,w the equivariance of Gk is measured as the dispersion after applying the action of the inverse group representation ρ 1 k,w.

generated by some known group transformations. This approximation of DLSBD is designed for a group decomposition G = G1 GK where each Gk = SO(Dk) with k {1, . . . , K} the group of rotations in Dk dimensions. This implementation approximates the integrals of Equation (7) by using the empirical distribution of X. The invariant subspaces of Z to the subgroup actions are found by applying a suitable change of basis. In the new basis, the disentangled group representations are expressed in a parametric form whose parameters are optimized to find the tightest bound to DLSBD. See Fig. 2 for an intuitive description of the process.

Assume there is a dataset X that can be modeled in terms of the group decomposition G = G1 Gk. For each Gk subgroup there is a set of known group elements Gk Gk uniformly sampled such that the dataset is described in terms of all elements in G = G1 GK and a base point x0 as X = {(g1, . . . , g K) x0|gk Gk, k {1, . . . , K}} .

For each subgroup Gk we construct a set of encoded data Zk Z whose variability should only depend on the action of Gk. The set Zk is given by Zk = {zk(g1, . . . , g K)|gj Gj , j {1, . . . , K}}, in which

zk(g1, . . . , g K) = h((g1, . . . , g K) x0)

g Gk h((g1, . . . , gk 1, g , gk+1, . . . , g K) x0).

Similar to Cohen & Welling (2014), we find a suitable change of basis that exposes the invariant subspace Zk corresponding to the k-th subgroup Gk. The new basis is obtained from the eigenvectors resulting from applying Principal Component Analysis (PCA) to Zk. Each element in Zk

is projected into the first Dk eigenvectors. The new set is denoted as Z k RDk with elements z k(g1, . . . , g K) RDk that are the projected versions of zk(g1, . . . , g K).

Quessard et al. (2020) describe how one could parameterize the subgroup representations of SO(Dk) for arbitrary Dk but here we will focus on Gk = SO(2). In this case, we can parameterize each subgroup representation in terms of a single integer parameter ω Z as ρk,ω(gk) corresponding to a 2 2 rotation matrix whose angle of rotation is ω multiplied by the known angle associated to the group element gk Gk = SO(2). For this subgroup we can approximate the Mρ,h,x0 in Equation (7) as Mk,ω given by

(g1,...,g K) G ρk,ω(g 1 k ) z (g1, . . . , g K). (9)

Similar to Equation (7) we would like to find the optimal ρk,ω that minimizes the integral over the group representations. We can define a parameter search space Ω Z, e.g. Ω= [ 10, 10] for finding the optimal ω Ωthat minimizes the dispersion, this is expressed in the following equation

D(k) LSBD =

min ω Ω 1 |G|

(g1,..,g K) G ρk,ω(g 1 k ) z (g1, .., g K) Mk,ω 2.

Each D(k) LSBD measures the degree of equivariance of the projected embeddings for each k-th subgroup corresponding to the best fitting group representation. The upper bound to the metric is finally obtained by averaging across all subgroups: DLSBD 1

K PK k=1 D(k) LSBD.

Quantifying and Learning Linear Symmetry-Based Disentanglement

Our practical implementation of DLSBD is for SO(2) subgroups, however the procedure can in principle be extended to other subgroups as well. A practical implementation of the metric requires (i) identifying the subspaces invariant to a subgroup and (ii) identifying a parametric representation of the subgroup that can be fitted to the subspace data representations. In cases where the exact form of the subgroup is unknown, an option is to use the method by Pfau et al. (2020) to factorize the submanifolds associated with different generative factors.

5. Learning LSBD: LSBD-VAE

In this section we present LSBD-VAE, a semi-supervised VAE-based method to learn LSBD representations. The main idea is to train an unsupervised Variational Autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) with a suitable latent space topology, and use our metric as an additional loss term for batches of transformationlabeled data.

Assumptions LSBD-VAE requires some knowledge about the group structure G that is to be disentangled. Concretely, the group and its decomposition G = G1 . . . GK should be known, as well as a suitable linearly disentangled group representation ρ : G GL(Z) and a latent space Z = Z1 . . . ZK. Moreover, we assume there exists an embedded submanifold ZG Z such that the action of G on Z restricted to ZG is regular, and ZG is invariant under the action. Only ZG will then be used as the codomain for the encoding map, h : X ZG.

We demonstrate the assumptions above for the common group structure G = SO(2) SO(2). For the group representation ρ = ρ1 ρ2, with Z = R2 R2, we can use rotation matrices in R2 for ρ1 and ρ2. We can then use 1-spheres S1 = {z R2 : z = 1} for the embedded submanifold: ZG = S1 S1. In this case, the action of G on Z restricted to ZG is indeed regular, and ZG is invariant under the action.

Requiring the group structure G to be known is a relatively strong assumption, which limits the practical applicability of our method. However, a group structure can often be given as expert knowledge, like the presence of cyclic factors such as rotation, or in situations where transformations between observed data can easily be acquired such as in reinforcement learning.

Unsupervised Learning on Latent Manifold To learn encodings only on the latent manifold ZG, we use a Diffusion Variational Autoencoder ( VAE) (Perez Rey et al., 2020). VAEs can use any closed Riemannian manifold embedded in a Euclidean space as a latent space (or latent manifold), provided that a certain projection function from

the Euclidean embedding space into the latent manifold is known and the scalar curvature of the manifold is available. The VAE uses a parametric family of posterior approximates obtained from a diffusion process over the latent manifold. To estimate the intractable terms of the negative ELBO, the reparameterization trick is implemented via a random walk.

In the case of S1 as a latent (sub)manifold, we consider R2 as the Euclidean embedding space, and the projection function5 Π : R2 S1 normalizes points in the embedding space: Π(z) = z/|z|. The scalar curvature of S1 is 0.

Semi-Supervised Learning with Transformation Labels Caselles-Dupr e et al. (2019) proved that LSBD representations cannot be inferred from a training set of unlabeled observations, but that access to the transformations between data points is needed. They therefore use a training set of observation pairs with a given transformation between them.

However, we posit that only a limited amount of supervision is sufficient. Since obtaining supervision on transformations is typically more expensive than obtaining unsupervised observations, it is desirable to limit the amount of supervision needed.

Therefore, we augment the unsupervised VAE with a supervised method that makes use of transformation-labeled batches, i.e. batches {xm}M m=1 such that xm = gm x1 for m = 2, . . . , M, where the transformations gm (and thus their group representations ρ(gm)) are known and are referred to as transformation labels. The simplified version of the metric from Equation (4) can then be used for each batch as an additional loss term (with x0 = x1), as it is differentiable under the assumptions described above (using the Euclidean norm).

We make a small adjustment to Equation (4) for the purpose of our method, since the mean computed there does not typically lie on the latent manifold ZG. Thus, we use the projection Π from the VAE to project the mean onto ZG. Writing the encodings as zm := h(xm), the additional loss term for a transformation-labeled batch {xm}M m=1 becomes

ρ(g 1 m ) zm Π

m=1 ρ(g 1 m ) zm

where g1 = e, the group identity.

Moreover, instead of feeding the encodings zm to the decoder, we use ρ(gm) z, where

5This projection function is not defined for z = 0, but this value does not occur in practice.

Quantifying and Learning Linear Symmetry-Based Disentanglement

(c) Airplane

(d) Model Net40

(e) COIL-100

Figure 3: Example images from each of the datasets used. Each row shows different examples from a single factor changing.

Figure 4: Overview of the supervised part of LSBD-VAE.

z = Π 1 M PM m=1 ρ(g 1 m ) zm . This encourages the decoder to follow the required group structure. This only affects the reconstruction loss component of the VAE.

Fig. 4 illustrates the supervised part of our method for a transformation-labeled batch {xm}M m=1. The loss function is the regular ELBO (but with adjusted decoder input as described above) as used in VAE plus an additional term γ LLSBD, where γ is a weight hyperparameter to control the influence of the supervised loss component. By alternating unsupervised and supervised training (using the same encoder and decoder), we have a method that makes use of both unlabeled and transformation-labeled observations.

6. Experimental Setup

Data We evaluate the disentanglement of several models on three different image datasets (Square, Arrow, and Airplane) with a known group decomposition G = SO(2) SO(2) describing the underlying transformations. For each subgroup a fixed number of |Gk| = 64 with k {1, 2} transformations is selected. The datasets exemplify different group actions of SO(2): periodic translations, in-plane rotations, out-of-plane rotations, and periodic hue-shifts.

In real settings, not all variability in the data can be modelled

by the actions of a group. Therefore, we also evaluate the same models on two datasets Model Net40 (Wu et al., 2014) and COIL-100 (Nene et al., 1996) that consist of images from various objects (i.e. non-symmetric variation) under known out-of-plane rotations (SO(2) symmetries). In many settings it is easy to obtain labels for such rotations, e.g. when the camera or object angle is controlled by an agent. See Fig. 3 for examples of the datasets. For more details, see Appendix E.

Note that we do not evaluate our LSBD-VAE method and DLSBD metric on traditional disentanglement datasets as evaluated by Locatello et al. (2019), since these datasets lack a clear underlying group structure. However, our results on the Model Net40 and COIL-100 datasets show that our method can disentangle properties with a group structure from properties without such a structure.

LSBD-VAE with Semi-Supervised Labelled Pairs For the Square, Arrow, and Airplane datasets we test LSBDVAE with transformation-labeled batches of size M = 2. More specifically, for each experiment we randomly select L disjoint pairs of data points, and label the transformation between the data points in each pair. We vary the number of labeled pairs L from 0 (corresponding to a VAE) to N/2 (in which case each data point is involved in exactly one labeled pair). We set the weight γ of the supervised loss component to γ = 100 for all experiments. We choose M = 2 for our experiments since it is the most limited setting for LSBD-VAE. Higher values of M would provide stronger supervision, so successful results with M = 2 imply that good results can also be achieved for higher values of M (but not necessarily vice versa).

For the COIL-100 and Model Net40 datasets, we train LSBDVAE on batches containing images of one particular object from all different angles (72 and 64 for COIL-100 and Model Net40, respectively). Each batch is labelled with transformations (g1, e), . . . , (g M, e), where gm represent rotations, and the unit transformation e indicates that the object is unchanged. To represent the rotations we use a S1 latent

Quantifying and Learning Linear Symmetry-Based Disentanglement

LSBD-VAE/256

LSBD-VAE/512

LSBD-VAE/768

LSBD-VAE/1024

LSBD-VAE/1280

LSBD-VAE/1536

LSBD-VAE/1792

LSBD-VAE/full

LSBD-VAE/paths

Arrow Airplane Square

(a) Datasets with SO(2) SO(2) symmetries

LSBD-VAE/full

dataset COIL-100 Model Net40

(b) Datasets with SO(2) and non-symmetric variation

Figure 5: DLSBD scores for all methods on all datasets

space as in VAE, whereas for the object identity we use a 5-dimensional Euclidean space with standard Gaussian prior as in regular VAEs. LSBD is measured as the disentanglement of rotations in the latent space. For these experiments we used γ = 1.

LSBD-VAE with Paths of Consecutive Observations It is often cheap to obtain transformation labels in settings where we can apply simple transformations and observe its effect, such as an agent navigating its environment. By registering actions (e.g. rotate left over a given angle) and the resulting observations, we can construct a path of consecutive views with known in-between transformations. We can then use these paths to train a LSBD-VAE.

For the datasets with G = G1 G2 = SO(2) SO(2) (Square, Arrow, Airplane), we generate random paths by consecutively applying one randomly chosen transformation from {g1, g 1 1 , g2, g 1 2 } where gk Gk for k {1, 2}, starting from randomly chosen observations. In our experiments, we generate 50 paths of length 100, and gk corresponds to an SO(2) transformation corresponding to an angle of 3

642π radians. Example paths can be found in Fig. 8 in the Appendix.

For the COIL-100 and Model Net40 datasets there is only one group to disentangle. Therefore, similar random walks are not very meaningful here, and we do not evaluate them for these datasets.

Other Disentanglement Methods We furthermore test a number of known disentanglement methods for comparison, including traditional disentanglement methods as well as methods focusing on LSBD. In particular, we use disentanglement lib (Locatello et al., 2019) to train a regular VAE (Kingma & Welling, 2014; Rezende et al., 2014), β-VAE (Higgins et al., 2017), CC-VAE (Burgess

et al., 2018), Factor VAE (Kim & Mnih, 2018), and DIPVAE-I/II (Kumar et al., 2018). We also include two weaklysupervised models, Ada GVAE and Ada MLVAE (Locatello et al., 2020), which are trained on pairs of data with few changing factors, to test whether this kind of supervision is helpful for LSBD. Furthermore we evaluate the method from Quessard et al. (2020) that focuses on LSBD. We also tested Forward VAE (Caselles-Dupr e et al., 2019), but show only limited results since we were not able to reproduce any reasonable results for our datasets.

Most of these methods have no notion of an underlying group structure, and thus do not give a fully fair comparison with our LSBD-VAE method. However, we emphasize that the main goal of our experiments is to investigate properties of disentangled representations from both the traditional and the LSBD perspective.

Disentanglement Metrics We use encodings from all methods to evaluate DLSBD, as well as common traditional disentanglement metrics from disentanglement lib: Beta (Higgins et al., 2017), Factor (Kim & Mnih, 2018), SAP (Kumar et al., 2018), DCI Disentanglement (Eastwood & Williams, 2018), Mutual Information Gap (MIG) (Chen et al., 2018), and Modularity (MOD) (Ridgeway & Mozer, 2018).

Further Details More information about the architectures, epochs and hyperparameters can be found in Appendix F. For the traditional disentanglement methods trained on Square, Arrow and Airplane datasets the latent spaces have 4 dimensions, since these are the minimum number of dimensions necessary to learn LSBD representations for an underlying SO(2) SO(2) symmetry group, see (Higgins et al., 2018; Caselles-Dupr e et al., 2019). For COIL-100 and Model Net40 we use latent spaces with 7 dimensions for

Quantifying and Learning Linear Symmetry-Based Disentanglement

a fair comparison with the LSBD-VAE method.

7. Results: Evaluating LSBD with DLSBD

We now highlight three key observations from our experimental results. In particular, we differentiate between the methods (VAE, β-VAE, CC-VAE, FACTOR, DIP-I, DIPII) and metrics (BETA, FACTOR, SAP, DCI, MIG, MOD) that approach disentanglement in the traditional sense, and methods ( VAE, QUESSARD, LSBD-VAE) and metric (DLSBD) that focus specifically on LSBD. The full quantitative results can be found in Appendix H. Further qualitative results can be found in Appendix G.

7.1. Traditional Disentanglement Methods Don t Learn LSBD Representations

Fig. 5 summarizes the DLSBD scores (lower is better) for all methods on all datasets. Bars show the mean scores over 10 runs for each method, the vertical lines represent standard deviations. LSBD-VAE/L indicates our method trained on L labelled pairs (LSBD-VAE/0 corresponds to the unsupervised VAE), LSBD-VAE/full indicates our method where all images are involved in exactly one labelled pair. and LSBD-VAE/paths indicates our method trained with paths of consecutive observations. Note that LSBD-VAE obtained very good scores (near 0) on the Arrow and Square datasets, hence the missing bars.

None of the traditional disentanglement methods achieve good DLSBD scores, even if they score well on other traditional disentanglement metrics. This implies that LSBD isn t achieved by traditional methods. Moreover, from the full results in Appendix H we see that the traditional methods on these datasets do not achieve good scores on all traditional metrics. In particular, SAP, DCI, and MIG scores are low. We believe this is a result of the cyclic nature of the symmetries underlying our datasets, further emphasizing the need for disentanglement methods that can capture such symmetries.

The SAP and MIG scores measure to what extent generative factors are disentangled into a single latent dimension. However, since the factors in our dataset are inherently cyclic due to their symmetry structure, they cannot be properly represented in a single latent dimension, as shown by Perez Rey et al. (2020). Instead, at least two dimensions are needed to continuously represent each cyclic factor in our data. A similar conclusion was made by Caselles-Dupr e et al. (2019) and Painter et al. (2020).

DCI disentanglement measures whether a latent dimension captures at most one generative factor. This is accomplished by measuring the importance of each latent dimension in predicting the true generative factor using boosted trees. However, since the generative factors are cyclic, the per-

formance of the boosted tree classifiers is far from optimal, thus providing more importance to several dimensions in predicting the generative factors and giving overall lower DCI scores.

7.2. LSBD-VAE and other LSBD Methods Can Learn LSBD Representations with Limited Supervision on Transformations

From Fig. 5 we observe that methods focusing specifically on LSBD can score higher on DLSBD, showing that they are indeed more suitable to learn LSBD representations. In particular, LSBD-VAE got very good DLSBD scores for all datasets. Moreover, our experiments on the Arrow, Airplane, and Square datasets also show that only limited supervision suffices to obtain good DLSBD scores with low variability, either with few transformation-labelled pairs or with paths of consecutive observations that are easy to obtain in agentenvironment settings.

We only partially managed to reproduce the results from Quessard et al. (2020) on our datasets. Their method scored fairly well on the Airplane, Model Net40, and COIL-100 datasets, but did not do well on the Square and Arrow dataset in our experiments.

Furthermore, we tested Forward VAE by Caselles-Dupr e et al. (2019), but we could not produce any reasonable results on our datasets. Therefore, we do not include scores for this method. We did manage to reproduce Forward VAE s results on the Flatland dataset used in the original paper, for which we computed a mean DLSBD score of 0.012 with standard deviation 0.001 over 10 runs, confirming that Forward VAE indeed learns LSBD representations for Flatland.

7.3. LSBD Representations Also Satisfy Previous Disentanglement Notions

Our results also indicate that LSBD captures various desirable properties that are expressed by traditional disentanglement metrics. In Fig. 6 we compare DLSBD scores with scores for previous disentanglement metrics. Note that for DLSBD lower is better, whereas for all other metrics higher is better. As we noted before, good scores on traditional disentanglement metrics don t necessarily imply good DLSBD scores. Conversely however, methods that score well on DLSBD also score well on many traditional disentanglement metrics, often even outperforming the traditional methods. In particular, from the full results (see Appendix H) we see that LSBD-VAE matches or outperforms the traditional methods on the BETA, FACTOR and MOD metrics, and achieves much better scores for the DCI metric where traditional methods scored poorly.

The MIG and SAP scores are still low for methods focusing on LSBD. This is expected however, as explained earlier in

Quantifying and Learning Linear Symmetry-Based Disentanglement

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 DLSBD

metric value

metric Beta Factor SAP DCI MIG MOD

(a) Scatter plot

0.25 0.50 0.75 1.00 1.25 1.50 1.75 DLSBD bin

metric value

metric Beta Factor SAP DCI MIG MOD

(b) Binned by DLSBD value

Figure 6: Comparing DLSBD to previous disentanglement metrics

Section 7.1. This was also observed by Painter et al. (2020) for different datasets.

8. Conclusion

We presented DLSBD, a metric to quantify Linear Symmetry Based Disentanglement (LSBD) as defined by Higgins et al. (2018). We used this metric formulation to motivate LSBDVAE, a semi-supervised method to learn LSBD representations given some expert knowledge on the underlying group symmetries that are to be disentangled.

We used DLSBD to evaluate various disentanglement methods, both traditional methods and recent methods that specifically focus on LSBD, and showed that LSBD-VAE can learn LSBD representations where traditional methods fail to do so. We also compared DLSBD to traditional disentanglement metrics, showing that LSBD captures many of the same desirable properties that are expressed by existing disentanglement methods. Conversely, we also showed that traditional disentanglement methods and metrics do not usually achieve or measure LSBD.

Challenges that remain are expanding and testing LSBDVAE and DLSBD on different group structures, towards more practical applications, as well as focusing on the utility of LSBD representations for downstream tasks.

9. Acknowledgements

This work has received funding from the Electronic Component Systems for European Leadership Joint Undertaking under grant agreement No 737459 (project Productive4.0). This Joint Undertaking receives support from the European Union Horizon 2020 research and innovation program and Germany, Austria, France, Czech Republic, Netherlands, Belgium, Spain, Greece, Sweden, Italy, Ireland, Poland, Hungary, Portugal, Denmark, Finland, Luxembourg, Nor-

way, Turkey.

This work has also received funding from the NWO-TTW Programme Efficient Deep Learning (EDL) P16-25.

Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in β-VAE. ar Xiv preprint ar Xiv:1804.03599, 2018.

Caselles-Dupr e, H., Ortiz, M. G., and Filliat, D. Symmetrybased disentangled representation learning requires interaction with environments. In Advances in Neural Information Processing Systems, pp. 4606 4615, 2019.

Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2615 2625, 2018.

Cohen, T. and Welling, M. Learning the irreducible representations of commutative Lie groups. 31st International Conference on Machine Learning, pp. 3757 3770, 2014.

Cohen, T. S. and Welling, M. Transformation properties of learned visual representations. In 3rd International Conference on Learning Representations, 2015.

Community, B. O. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2020. URL http://www. blender.org.

Eastwood, C. and Williams, C. K. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, 2018.

Quantifying and Learning Linear Symmetry-Based Disentanglement

Hall, B. C. Lie Groups, Lie Algebras, and Representations, volume 222 of Graduate Texts in Mathematics. Springer International Publishing, Cham, 2015. ISBN 978-3-319-13466-6. doi: 10.1007/978-3-319-13467-3. URL http://link.springer.com/10.1007/ 978-3-319-13467-3.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. β-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.

Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., and Lerchner, A. Towards a definition of disentangled representations. ar Xiv preprint ar Xiv:1812.02230, 2018.

Kim, H. and Mnih, A. Disentangling by factorising. In International Conference on Machine Learning, pp. 2649 2658, 2018.

Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In International Conference on Learning Representations, 2014.

Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations, 2018.

Locatello, F., Bauer, S., Lucic, M., Gelly, S., Sch olkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, 2019.

Locatello, F., Poole, B., Raetsch, G., Sch olkopf, B., Bachem, O., and Tschannen, M. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, 2020. URL https://proceedings.mlr.press/v119/ locatello20a.html.

Nene, S. A., Nayar, S. K., Murase, H., et al. Columbia object image library (coil-20). 1996.

Painter, M., Prugel-Bennett, A., and Hare, J. Linear disentangled representations and unsupervised action estimation. Advances in Neural Information Processing Systems, 33, 2020.

Perez Rey, L. A., Menkovski, V., and Portegies, J. Diffusion variational autoencoders. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pp. 2704 2710, 2020.

Pfau, D., Higgins, I., Botev, A., and Racani ere, S. Disentangling by Subspace Diffusion. pp. 1 21, 2020. URL http://arxiv.org/abs/2006.12982.

Quessard, R., Barrett, T. D., and Clements, W. R. Learning Group Structure and Disentangled Representations of Dynamical Environments. Advances in Neural Information Processing Systems, 33, 2020.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on International Conference on Machine Learning-Volume 32, 2014.

Ridgeway, K. and Mozer, M. C. Learning deep disentangled embeddings with the f-statistic loss. In Advances in Neural Information Processing Systems, pp. 185 194, 2018.

Soatto, S. Steps Towards a Theory of Visual Information: Active Perception, Signal-to-Symbol Conversion and the Interplay Between Sensing and Control. ar Xiv preprint ar Xiv:1110.2053, 2011.

Sosnovik, I., Szmaja, M., and Smeulders, A. Scale Equivariant Steerable Networks. International Conference on Learning Representations, pp. 1 14, 2019.

Tensor Flow Datasets, . Tensor Flow Datasets, a collection of ready-to-use datasets. https://www.tensorflow. org/datasets, 2021.

Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and Brostow, G. J. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028 5037, 2017.

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., and Xiao, J. 3D Shape Nets: A Deep Representation for Volumetric Shapes. 2014. ISSN 10636919. doi: 10.1109/CVPR.2015.7298801. URL http://arxiv. org/abs/1406.5670.

Quantifying and Learning Linear Symmetry-Based Disentanglement

A. Preliminaries: Group Theory

In this appendix, we summarize some concepts from group theory that are important to understand the main text of the paper. Group theory provides a useful language to formalize the notion of symmetry transformations and their effects. For a more elaborate discussion we refer the reader to the book from Hall (2015) on group theory.

Group A group is a non-empty set G together with a binary operation : G G G that satisfies three properties:

1. Associativity: For all f, g, h G, it holds that f (g h) = (f g) h.

2. Identity: There exists a unique element e G such that for all g G it holds that e g = g e = g.

3. Inverse: For all g G there exists an element g 1 G such that g 1 g = g g 1 = e.

Direct product Let G and G be two groups. The direct product, denoted by G G , is the group with elements (g, g ) G G with g G and g G , and the binary operation : G G G G such that (g, g ) (h, h ) = (g h, g h ).

Lie group A Lie group is a group where G is a smooth manifold, this means it can be described in a local scale with a set of continuous parameters and that one can interpolate continuously between elements of G.

Group action Let A be a set and G a group. The group action of G on A is a function GA : G A A that has the properties 6

1. GA(e, a) = a for all a A

2. GA(g, (GA(g , a)) = GA(g g , a) for all g, g G and a A

Regular action The action of G on A is regular if for every pair of elements a, a A there exists a unique g G such that g a = a .

Group representation A group representation of G in the vector space V is a function ρ : G GL(V ) (where GL(V ) is the general linear group on V ) such that for all g, g G ρ(g g ) = ρ(g) ρ(g ) and ρ(e) = IV , where IV is the identity matrix.

Direct sum of representations The direct sum of two representations ρ1 : G GL(V ) in V and ρ2 : G GL(V ) in V is a group representation ρ1 ρ2 : G GL(V V ) over the direct sum V V , defined for v V and v V as:

(ρ1 ρ2)(g) (v, v ) = (ρ1(g) v, ρ2(g) v ) (12)

B. Linear Symmetry-Based Disentanglement: Definition with respect to World States

Higgins et al. (2018) provide a formal definition of linear disentanglement that connects symmetry transformations affecting the real world (from which data is generated) to the internal representations of a model. In the main text, we provide a definition from the perspective of a group action on the data directly, but the original definition considers an extra conceptual world state as well. Here, we describe the original setting in more detail, and explain why we choose a more direct and practical version of the definition.

The definition assumes the following setting. W is the set of possible world states, with underlying symmetry transformations that are described by a group G and its action : G W W on W. In particular, G can be decomposed as the direct product of K groups G = G1 . . . GK. Data is obtained via an observation function b : W X that maps world states to observations in a data space X. A model s internal representation of data is modeled with the encoding function h : X Z that maps data to the embedding space Z. Together, the observation and the encoding constitute the model s internal representation of the real world f : W Z with f(w) = h b(w). The definition for Linearly Symmetry-Based Disentangled (LSBD) representations then formalizes the requirement that a model s internal representation f should reflect

6To avoid notational clutter, we write GA(g, a) = g a where the set A on which g G acts can be inferred from the context.

Quantifying and Learning Linear Symmetry-Based Disentanglement

and disentangle the transformation properties of the real world, and that the transformation properties of the model s internal representations should be linear.

The original definition considers G acting on W and involves the model s internal representation f : W Z, but since we do not directly observe W it is more practical to evaluate LSBD with respect to the encoding map h : X Z instead. If the action of G on W is regular7 and the observation map b : W X is injective8 though, we can instead define LSBD with respect to the action of G on X and the encoding map h, as shown in the main text.

C. Inner Product

To describe the norm ρ,h,µ used in the definition of DLSBD we start with an arbitrary inner product ( , ) on the linear latent space Z. Assume that ρ is linearly disentangled and accordingly splits in irreducible representations ρk : G Zk where Z = Z1 ZK for some K N. We will define a new inner product , ρ,h,µ on Z as follows. First of all we declare Zk and Zm to be orthogonal with respect to , ρ,h,µ if k = m. We denote by πk the orthogonal projection on Zk.

For z, z Zi, we set

z, z ρ,h,µ := λ 1 k,h,µ

g G (ρ(g) z, ρ(g) z )dm(g) (13)

where m is the (bi-invariant) Haar measure normalized such that m(G) = 1 and set

λk,h,µ := Z

G πk(h(x)) 2dm(g)dµ(x) (14)

if the integral on the right-hand side is strictly positive and otherwise we set λk := 1. This construction completely specifies the new inner product, and it has the following properties:

the subspaces Zk are mutually orthogonal,

ρk(g) is orthogonal on Zk for every g G, in other words ρk maps to the orthogonal group on Zk. Moreover, ρ maps to the orthogonal group on Z. This follows directly from the bi-invariance of the Haar measure and the definition of , ρ,h,µ.

If πk is the orthogonal projection to Zk, then

X πk(h(x)) 2 ρ,h,µdµ(x) = 1 (15)

if the integral on the left is strictly positive.

For an arbitrary pair z, z Z the inner product , ρ,h,µ is given by

z, z ρ,h,µ =

k=1 λ 1 k,h,µ

g G (ρ(g) πk(z), ρ(g) πk(z ))dm(g) (16)

D. Evaluation of Equivariance by DLSBD

We will now give an alternative expression for the disentanglement metric DLSBD, since it will more visibly relate to the definition of equivariance. To avoid notational cluttering, in this section we will denote the norm ρ,h,µ as . Let ρ P(G, Z) be a linear disentangled representation of G in Z. By expanding the inner product (or by using usual

7This assumption holds in most practical cases with a suitable description of G. 8This is typically the case, but if not it can be solved through active sensing, see Soatto (2011).

Quantifying and Learning Linear Symmetry-Based Disentanglement

computation rules for expectations and variances), we first find that Z

ρ(g) 1 h(g x0) Z

G ρ(g ) 1 h(g x0)dν(g )

ρ(g) 1 h(g x0) 2 dν(g)

G ρ(g) 1 h(g x0)dν(g)

G ρ(g) 1 h(g x0) ρ(g ) 1 h(g x0) 2 dν(g)dν(g ).

We now use that ρ maps to the orthogonal group for ( , ) , so that we can write the same expression as

G ρ(g g 1) 1 h(((g g 1) g ) x0) h(g x0) 2 dν(g)dν(g ). (18)

This brings us to the alternative characterization of DLSBD as

DLSBD = inf ρ P(G,Z) 1 2

G ρ(g g 1) 1h(((g g 1) g ) x0) h(g x0) 2 dν(g)dν(g ). (19)

In particular, if for every data point x there is a unique group element gx such that x = gx x0, the disentanglement metric DLSBD can also be written as

inf ρ P(G,Z) 1 2

X ρ(g g 1 x ) 1h((g g 1 x ) x) h(x) 2 dν(g)dµ(x), (20)

in which the equivariance condition appears prominently. The condition becomes even more apparent if ν is in fact the Haar measure itself, in which case the metric equals

inf ρ P(G,Z) 1 2

X ρ(g) 1 h(g x) h(x) 2 dm(g)dµ(x). (21)

E. Datasets

All datasets contain 64 64 pixel images. The Square, Arrow and Airplane datasets have a known group decomposition G = SO(2) SO(2) describing the underlying transformations. In these three datasets, for each subgroup a fixed number of |Gk| = 64 with k {1, 2} transformations is selected. Each image is generated from a single initial data point upon which all possible group actions are applied, resulting in datasets with |G1| |G2| = 4096 images. The datasets exemplify different group actions of SO(2): periodic translations, in-plane rotations, out-of-plane rotations, and periodic hue-shifts, see Fig. 7.

The Model Net40 and the COIL-100 datasets consist of different objects rotating with respect to a vertical axis (out-of-plane rotation). For these datasets the group G = SO(2) describes the underlying transformations that each object undergoes, see Fig. 7. The different objects can be seen as non-symmetric variability in the data. In this particular case, each object has its own base-point x0 from which data is generated. The metric DLSBD is then evaluated per object instance for the group G = SO(2), the value of DLSBD is calculated and averaged across all available objects. Fig. 8 shows some example paths of consecutive observations for the Square, Arrow, and Airplane datasets, as explained in Sect. 6.

Square This dataset consists of a set of images of a black background with a square of 16 16 white pixels. The dataset is generated applying vertical and horizontal translations of the white square considering periodic boundaries.

Arrow This dataset consists of a set of images depicting a colored arrow at a given orientation. The dataset is generated by applying cyclic shifts of its color and in-plane rotations. The cyclic color shifts were obtained by preselecting a fixed set of 64 colors from a circular hue axis. The in-plane rotations were obtained by rotating the arrow along an axis perpendicular to the picture plane over 64 predefined positions.

Airplane This dataset consists of renders obtained using Blender v2.7 (Community, 2020) from a 3D model of an airplane within the Model Net40 dataset (Wu et al., 2014) (this dataset is provided for the convenience of academic research only). We created each image by varying two properties: the airplane s color and its orientation with respect to the render camera. The orientation was changed via rotation with respect to a vertical axis (out-of-plane rotation). The colors of the model were selected from a predefined cyclic set of colors similar to the arrow rotation dataset.

Quantifying and Learning Linear Symmetry-Based Disentanglement

(c) Airplane

(d) Model Net40

(e) COIL-100

Figure 7: Example images from each of the datasets used. Each image corresponds to an example data point for a combination of two factors, e.g. color and orientation. The factors change horizontally and vertically. The boundaries for the Square, Arrow and Airplane dataset are periodic. For the Model Net40 and COIL-100 dataset, the vertical direction represents different object instances and the horizontal direction represents the rotation of the corresponding object.

Model Net40 This dataset also consists of a dataset of renders obtained using Blender v2.7 (Community, 2020) from the 626 training 3D models within the airplane category of the Model Net40 dataset (Wu et al., 2014). We created each image by varying each airplane s orientation with respect to the render camera, via rotation with respect to a vertical axis (out-of-plane rotation). In this case we used 64 orientations for each object, i.e. |G| = 64, for a total of 626 objects, thus the dataset consists of 40,064 images.

COIL-100 This dataset (Nene et al., 1996) consists of images from 100 objects placed on a turntable against a black background. For each object, 72 views of the rotated object are provided. The original images have a resolution of 128 128 and were re-scaled to 64 64 to match our other datasets. In this case for each object |G| = 72, thus the total dataset consists of 7200 images. This dataset is intended for non-commercial research purposes only. This dataset was obtained using Tensor Flow Datasets (2021).

F. Experimental Settings and Hyperparameters

F.1. Architectures

Table 1 shows the encoder and decoder architectures used for almost all methods and datasets. The encoder s last layer depends on the method. For VAE, cc-VAE, Factor VAE, DIP-I, DIP-II, two dense layers with 4 units each were used. For LSBD-VAE and VAE two dense layers with 4 and 2 units each were used. For Quessard a single dense layer with 4 units was used. The only model that was not trained with this architectures was LSBD-VAE/0 method for the Model Net40 dataset the reason for this choice was that during training the loss was getting Na N values, in this case the architecture used was that

Quantifying and Learning Linear Symmetry-Based Disentanglement

(c) Airplane

Figure 8: Example paths of consecutive observations.

of Table 2.

Table 1: Encoder and decoder architectures used in most methods.

INPUT SIZE (64,64, NUMBER CHANNELS) CONV FILTERS 32, KERNEL 4, STRIDE 2, RELU CONV FILTERS 32, KERNEL 4, STRIDE 2, RELU CONV FILTERS 64, KERNEL 4, STRIDE 2, RELU CONV FILTERS 64, KERNEL 4, STRIDE 2, RELU DENSE UNITS 256, RELU DENSE(X2) UNITS DEPEND ON METHOD

INPUT SIZE (NUMBER OF LATENT DIMENSIONS) DENSE UNITS 256, RELU DENSE UNITS 4*4*64, RELU RESHAPE (4,4,64) CONVT FILTERS 64, KERNEL 4, STRIDE 2, RELU CONVT FILTERS 32, KERNEL 4, STRIDE 2, RELU CONVT FILTERS 32, KERNEL 4, STRIDE 2, RELU CONVT FILTERS (NUMBER CHANNELS), KERNEL 4, STRIDE 2, SIGMOID

F.2. Hyperparameters

Table 3 shows the hyperparameters used to train each model for all datasets. Table 4 shows the hyperparameters used to train the LSBD-VAE models for each dataset. In the latter case, the number of epochs for the LSBD-VAE model were increased. The range of values used for the scale parameter t were increased for Model Net40 and COIL-100 datasets since it was noticed that this provided better results in terms of data reconstruction and disentanglement. For the Arrow dataset, a

Quantifying and Learning Linear Symmetry-Based Disentanglement

Table 2: Encoder and decoder architecture used to train LSBD-VAE/0 for Model Net40 dataset.

INPUT SIZE (64, 64, NUMBER CHANNELS) DENSE UNITS 512, RELU, BATCH NORMALIZATION DENSE UNITS 256, RELU, BATCH NORMALIZATION DENSE(X2) UNITS DEPEND ON METHOD

INPUT SIZE (NUMBER OF LATENT DIMENSIONS) DENSE UNITS 256, RELU, BATCH NORMALIZATION DENSE UNITS 512, RELU, BATCH NORMALIZATION DENSE UNITS 64*64*NUMBER OF CHANNELS, SIGMOID RESHAPE (64, 64, NUMBER OF CHANNELS)

value of γ = 1 was producing unstable results. However, the values 10, 100, 1000 or even 10000 were producing good results without significant changes among them. Therefore the value 100 was used for the datasets with the same structure (Square, Arrow and Airplane). For the Model Net40 and COIL-100 the experiments showed that this hyperparameter for values as high as 10000 could affect the reconstructions, thus a lower value γ = 1 was chosen.

The training of the weakly-supervised models Ada GVAE and Ada MLVAE was done with a data generator that organized the available training data into pairs. The only condition introduced in Locatello et al. (2020) to train these models was to provide paired data with few factors changing among them. For our datasets, two factors change.

Table 3: Model hyperparameters for all datasets

MODEL PARAMETERS

VAE TRAINING STEPS 30000 β-VAE β = 5, TRAINING STEPS 30000 CC-VAE β = 5,γ = 1000, cmax = 15, ITERATION THRESHOLD 3500, TRAINING STEPS 30000 FACTOR γ = 1, EPOCHS 30000 DIP-I λod = 1, λd = 10, TRAINING STEPS 30000 DIP-II λod = 1, λd = 1, TRAINING STEPS 30000 ADAGVAE β = 1, EPOCHS 500 ADAMLVAE β = 1, EPOCHS 500 QUESSARD λ = 0.01, TRAJECTORIES 3000

Table 4: LSBD-VAE hyperparameters for all datasets

DATASETS PARAMETERS

SQUARE, ARROW, AIRPLANE t [10 10, 10 9], γ = 100.0, EPOCHS 1500 MODELNET40 t [10 10, 10 5], γ = 1.0, EPOCHS 1500 COIL-100 t [10 10, 10 5], γ = 1.0, EPOCHS 6000

F.3. Hardware & Running Time

The hardware used across all experiments was a DGX station with 4 NVIDIA GPUs V100 and 32GB . Only one GPU was used per experiment. The running time for the LSBD-VAE across all 9 degrees of supervision L {0, 256, 768, 1024, 1280, 1536, 1792, 2048} and all 10 runs (total 9 10 repetitions) for the datasets were: Arrow 33 4 minutes Airplane 29 4 minutes and Square 28 4 minutes. The running time for the LSBD-VAE across 2 degrees of supervision and 10 runs (total 2 10 repetitions) for Model Net40 was 136 10 minutes and for COIL-100 90 6 minutes. For the method from (Quessard et al., 2020) the training times were approximately 30 minutes across all datasets. The training times for the methods from disentanglement lib (Locatello et al., 2019) were not measured.

Quantifying and Learning Linear Symmetry-Based Disentanglement

F.4. Code Licenses

The disentanglement lib (Locatello et al., 2019) code is registered with an Apache 2.0 License while the code used to reproduce the method by Quessard et al. (2020) is registered with an MIT license.

G. Qualitative Results

G.1. Data Generation

Inspecting data generated by a model can help understand the structure of the learnt latent space in a qualitative way. Fig. 9 shows generated data obtained by sampling and decoding ten latent variables for each of the models trained on the COIL-100 and Model Net40 airplanes datasets. Each latent variable is sampled from the prior over the latent space and decoded to produce an image.

In general, all models but one produce similar results consisting of objects with unclear shape or identity. It is important to highlight the Ada GVAE weakly-supervised model trained on COIL-100 since it appears to have a degenerate decoder producing only yellow objects. Such behaviour occurs for all ten trained instances of the Ada GVAE model.

Even though the randomly generated images seem to have no clear identity or shape for COIL-100, LSBD-VAE allows to better determine the identity of such sampled models, by showing multiple orientations thanks to the structure of its latent space. LSBD-VAE uses a latent space combining an S1 manifold encouraged to encode information about the SO(2) rotations and an Euclidean latent space encouraged to represent the information about the object s identity.

By first sampling a latent variable from the Euclidean latent space and combining it with a set of regularly spaced latent variables along S1 we can observe some consistency in object identity, see Fig. 10. Such data generation cannot be directly obtained from traditional disentanglement methods since there is no clear direction representing either the object identities or the orientations.

G.2. Object Interpolation

Next, we will show how the latent space is structured among the latent variables representing the objects identities for models trained with COIL-100. We show the generated data obtained from decoding linearly interpolated latent variables between different objects to show the transitions between objects and orientations.

For LSBD-VAE the interpolation is simple; first the latent variables associated to the identity of the start and end objects are estimated by averaging the Euclidean latent variables of all images per object. Second, the linear interpolation between the object identity latent variables of the start and end object is calculated to generate a path through the object identity space. Finally, the estimated identity variables in the path are combined with regularly spaced variables in the orientation space S1

and decoded. See Fig. 12b (a).

In the case of the traditional disentanglement methods we cannot produce a latent variable representing an object s identity, so there is no clear traversal between objects. In this case, a linear interpolation between an image from the start object to and end object is calculated and the latent variables are decoded, see Fig. 12b. Notice that we cannot easily produce an image of an object with an arbitrary orientation since we do not know the shape of the loop in the latent space representing an object.

Fig. 11 shows the generated images obtained by interpolating between two objects. We only show cc-VAE representing traditional models since that method attained the lowest DLSBD. A particularly interesting interpolation is between the wooden object and the orange cat figure. The interpolation of cc-VAE shows how a green object is also crossed in between while LSBD-VAE shows a consistent transition between the objects a visual explanation of this observation is presented in Fig. 12b (b).

Quantifying and Learning Linear Symmetry-Based Disentanglement

(a) COIL 100

(b) Model Net40 Airplanes

Figure 9: Images obtained by decoding latent variables sampled according to the prior over the latent space for different models trained on the COIL-100 and Model Net40 airplanes datasets.

H. Full results

The full results for all experiments on all datasets are given in Tables 5, 6, 7, 8, and 9. We report the mean and standard deviation over 10 runs for each experiment.

H.1. Limited Supervision Suffices to Learn LSBD Representations

The results obtained from Tables 5, 6, 7 show that we do not need transformation-labels for all data points, only a subset of labeled pairs is sufficient to learn LSBD representations. To further highlight this, Fig. 13 shows DLSBD scores for LSBD-VAE trained on the Square, Arrow, and Airplane datasets respectively, for various values for the number of labeled pairs L. For each L and each dataset, we trained 10 models so we can report box plots of the DLSBD scores.

For low values of L we see worse scores and high variability. But for slightly higher L, scores are consistently good, starting already at L = 512 for the Square, L = 768 for the Arrow, and L = 256 for the Airplane. This corresponds to respectively 25%, 37.5%, and 12.5% of the data being involved in a labeled pair. Moreover, we see that with just a little supervision we outperform the best traditional method on DLSBD. Overall, these results suggest that with some expert knowledge (about the underlying group and a suitable representation) and limited annotation of transformations, LSBD can be achieved.

H.2. Quessard Arrow

In the main text we mentioned that we did not reproduce good results with Quessard et al. (2020) s method on the Arrow and Square dataset. We highlight a particular case for the Arrow dataset, where the method clearly learns the rotations of the arrow but fails to learn color. Fig. 14 shows reconstructed Arrow images. Since color isn t learned well, this example doesn t get a good DLSBD score, even though rotation is properly linearly disentangled.

Quantifying and Learning Linear Symmetry-Based Disentanglement

(a) Latent space structure

(b) Generated data

Figure 10: Image generation by traversing the circular latent variable for a sampled object identity. The high dimensional Euclidean space is depicted as a single dimension in a hyper-cylinder. (a) The latent variable corresponding to the identity is sampled from the prior over the Euclidean latent space and combined with regularly spaced latent variables on S1. (2) Each row presents the decoded images for a fixed Euclidean latent variable while each column shows the images for a fixed latent variable on S1. The images are obtained from decoding the latent variables with LSBD-VAE/full trained on the COIL-100 dataset.

Quantifying and Learning Linear Symmetry-Based Disentanglement

(b) LSBD-VAE

Figure 11: Images produced from the decoding of interpolated latent variables using cc-VAE and LSBD-VAE trained with COIL-100. Three interpolations between two objects are shown. Each column represents the transitions between objects while each row shows images that should correspond to different orientations.

Quantifying and Learning Linear Symmetry-Based Disentanglement

(a) Interpolation across Z = S1 RD

(b) Interpolation across Z = RD

Figure 12: Diagrams illustrating the interpolation between the latent variables associated to two objects. (a) Interpolation across a hyper-cylinder within Z = S1 RD used by LSBD-VAE. (b) Interpolation across Z = RD of traditional disentanglement models. In the traditional disentanglement models the linear interpolation can show the crossing of the latent codes associated to unexpected objects.

Labeled pairs L

Square Dataset

Labeled pairs L

Arrow Dataset

Labeled pairs L

Airplane Dataset

Figure 13: Box plots for DLSBD scores over 10 training repetitions for different numbers of labeled pairs L, for all datasets. The red line indicates the best-performing traditional disentanglement method.

(b) Reconstructions

Figure 14: Results from Quessard et al. (2020) s method on the Arrow dataset

Quantifying and Learning Linear Symmetry-Based Disentanglement

Table 5: Scores for the Square dataset.

MODEL BETA FACTOR SAP DCI MIG MOD DLSBD

VAE .945 .061 .835 .140 .019 .004 .009 .005 .013 .004 .579 .202 .634 .440

β-VAE .980 .033 .913 .095 .021 .006 .017 .011 .021 .014 .642 .147 .732 .488

CC-VAE .508 .023 .000 .000 .003 .002 .007 .002 .014 .004 .222 .110 1.905 .023

FACTOR .974 .048 .910 .104 .020 .003 .019 .017 .017 .010 .712 .183 .667 .428

DIP-I .972 .042 .861 .097 .020 .005 .010 .002 .011 .002 .618 .117 1.109 .312

DIP-II .930 .119 .848 .137 .018 .004 .010 .004 .015 .007 .607 .207 .907 .559

ADAGVAE .841 .230 .707 .386 .009 .009 .024 .015 .012 .005 .473 .185 .666 .378

ADAMLVAE .737 .208 .465 .403 .008 .008 .016 .006 .013 .007 .338 .128 1.063 .387

QUESSARD .504 .021 .000 .000 .004 .003 .007 .004 .018 .008 .354 .213 1.686 .294

LSBD-VAE .970 .079 .913 .121 .018 .003 .052 .052 .018 .004 .884 .183 .749 .554 /0 LSBD-VAE 1.000 .000 1.000 .001 .021 .004 .267 .152 .027 .007 .986 .023 .104 .147 /256 LSBD-VAE 1.000 .000 1.000 .000 .021 .006 .393 .022 .025 .005 .999 .000 .000 .000 /512 LSBD-VAE 1.000 .000 1.000 .000 .019 .004 .387 .014 .025 .004 .999 .000 .000 .000 /768 LSBD-VAE 1.000 .000 1.000 .000 .022 .005 .398 .020 .024 .003 .999 .000 .000 .000 /1024 LSBD-VAE 1.000 .000 1.000 .000 .023 .003 .389 .016 .023 .003 .999 .000 .000 .000 /1280 LSBD-VAE 1.000 .000 1.000 .000 .022 .004 .398 .013 .027 .002 .999 .000 .000 .000 /1536 LSBD-VAE 1.000 .000 1.000 .000 .020 .004 .397 .016 .027 .005 .999 .000 .000 .000 /1792 LSBD-VAE 1.000 .000 1.000 .000 .021 .006 .380 .027 .027 .005 .999 .000 .000 .000 /FULL LSBD-VAE .005 .002 /PATHS

Quantifying and Learning Linear Symmetry-Based Disentanglement

Table 6: Scores for the Arrow dataset.

MODEL BETA FACTOR SAP DCI MIG MOD DLSBD

VAE 1.000 .000 .646 .032 .017 .004 .009 .003 .013 .004 .961 .012 1.316 .193

β-VAE .999 .002 .588 .045 .018 .004 .008 .002 .015 .005 .898 .032 1.178 .065

CC-VAE .982 .056 .707 .102 .019 .004 .011 .005 .016 .004 .980 .038 1.013 .096

FACTOR 1.000 .000 .659 .028 .017 .003 .008 .003 .014 .002 .935 .037 1.526 .125

DIP-I 1.000 .000 .624 .042 .020 .004 .008 .002 .012 .003 .967 .027 1.521 .113

DIP-II 1.000 .000 .644 .064 .020 .004 .009 .003 .013 .004 .973 .011 1.616 .102

ADAGVAE 1.000 .000 .656 .137 .016 .005 .020 .009 .009 .004 .973 .042 1.620 .147

ADAMLVAE .997 .008 .706 .168 .017 .007 .019 .009 .011 .004 .943 .111 1.395 .117

QUESSARD 1.000 .000 .596 .032 .016 .006 .008 .004 .017 .008 .999 .000 1.183 .412

LSBD-VAE 1.000 .001 .664 .105 .016 .002 .009 .004 .019 .005 .897 .108 1.627 .104 /0 LSBD-VAE 1.000 .000 .662 .046 .017 .005 .009 .004 .020 .005 .963 .010 1.475 .121 /256 LSBD-VAE 1.000 .000 .956 .119 .021 .006 .297 .157 .023 .003 .967 .092 .245 .474 /512 LSBD-VAE 1.000 .000 1.000 .000 .022 .006 .390 .022 .026 .003 .999 .000 .000 .000 /768 LSBD-VAE 1.000 .000 1.000 .000 .022 .003 .396 .026 .026 .006 .999 .000 .000 .000 /1024 LSBD-VAE 1.000 .000 1.000 .000 .019 .005 .401 .018 .026 .004 .999 .000 .000 .000 /1280 LSBD-VAE 1.000 .000 1.000 .000 .019 .005 .397 .017 .026 .007 .999 .000 .000 .000 /1536 LSBD-VAE 1.000 .000 1.000 .000 .020 .004 .399 .018 .026 .004 .999 .000 .000 .000 /1792 LSBD-VAE 1.000 .000 1.000 .000 .020 .006 .444 .186 .027 .004 .999 .000 .000 .000 /FULL LSBD-VAE .016 .006 /PATHS

Quantifying and Learning Linear Symmetry-Based Disentanglement

Table 7: Scores for the Airplane dataset.

MODEL BETA FACTOR SAP DCI MIG MOD DLSBD

VAE 1.000 .001 .947 .054 .023 .005 .013 .005 .020 .017 .801 .045 1.342 .084

β-VAE 1.000 .001 .997 .005 .018 .005 .036 .012 .028 .012 .816 .104 1.481 .129

CC-VAE .858 .194 .646 .353 .010 .006 .021 .011 .018 .009 .969 .034 1.481 .174

FACTOR 1.000 .000 .984 .015 .020 .003 .021 .008 .026 .013 .810 .040 1.382 .171

DIP-I 1.000 .000 .994 .008 .022 .004 .029 .012 .026 .012 .842 .073 1.289 .150

DIP-II .998 .005 .972 .031 .021 .004 .022 .013 .030 .019 .780 .054 1.367 .129

ADAGVAE .962 .120 .892 .314 .013 .009 .026 .016 .010 .008 .733 .264 1.029 .288

ADAMLVAE 1.000 .000 .995 .007 .019 .009 .035 .011 .017 .009 .861 .073 .994 .275

QUESSARD .999 .003 .987 .026 .018 .007 .016 .009 .018 .005 .795 .107 .558 .239

LSBD-VAE .536 .065 .000 .000 .002 .001 .007 .004 .005 .003 .956 .046 1.165 .180 /0 LSBD-VAE 1.000 .000 1.000 .000 .022 .006 .144 .011 .023 .004 .870 .039 .153 .021 /256 LSBD-VAE 1.000 .000 1.000 .000 .023 .008 .151 .015 .020 .004 .846 .032 .168 .022 /512 LSBD-VAE 1.000 .000 1.000 .000 .022 .004 .140 .014 .022 .005 .832 .034 .180 .030 /768 LSBD-VAE 1.000 .000 1.000 .000 .020 .005 .160 .015 .022 .005 .859 .032 .165 .021 /1024 LSBD-VAE 1.000 .000 1.000 .000 .024 .004 .153 .013 .022 .003 .876 .016 .151 .015 /1280 LSBD-VAE 1.000 .000 1.000 .000 .021 .005 .160 .016 .022 .004 .896 .025 .140 .018 /1536 LSBD-VAE 1.000 .000 1.000 .000 .022 .005 .163 .022 .023 .003 .904 .016 .138 .010 /1792 LSBD-VAE 1.000 .000 1.000 .000 .016 .008 .161 .024 .021 .006 .913 .018 .132 .009 /FULL LSBD-VAE .185 .017 /PATHS

Quantifying and Learning Linear Symmetry-Based Disentanglement

Table 8: Scores for the Modelnet40 Airplanes dataset.

MODEL BETA FACTOR SAP DCI MIG MOD DLSBD

VAE .995 .004 .838 .030 .013 .002 .013 .002 .009 .002 .415 .058 .393 .110

β-VAE .995 .005 .857 .045 .012 .003 .015 .003 .009 .002 .447 .067 .285 .045

CC-VAE .997 .003 .818 .093 .011 .003 .017 .004 .011 .003 .567 .063 .281 .191

FACTOR .996 .004 .856 .052 .012 .002 .014 .003 .010 .003 .444 .077 .388 .096

DIP-I .988 .009 .783 .070 .012 .002 .013 .002 .008 .001 .343 .082 .416 .142

DIP-II .994 .006 .832 .042 .013 .003 .014 .003 .011 .002 .433 .080 .379 .130

ADAGVAE .996 .006 .775 .079 .010 .006 .014 .006 .013 .004 .421 .092 .476 .218

ADAMLVAE .996 .006 .784 .055 .012 .006 .014 .005 .014 .004 .445 .040 .580 .141

QUESSARD .907 .192 .727 .384 .010 .005 .015 .007 .009 .004 .563 .108 .134 .294

LSBD-VAE .990 .009 .863 .038 .011 .003 .015 .003 .014 .003 .538 .103 .731 .068 /0 LSBD-VAE 1.000 .000 .990 .004 .012 .005 .052 .009 .020 .006 .947 .007 .041 .007 /FULL

Table 9: Scores for COIL 100 dataset.

MODEL BETA FACTOR SAP DCI MIG MOD DLSBD

VAE 1.000 .000 .674 .049 .014 .003 .016 .003 .011 .002 .986 .001 .463 .030

β-VAE 1.000 .001 .740 .024 .015 .004 .014 .004 .013 .003 .982 .001 .579 .095

CC-VAE .999 .003 .723 .026 .013 .005 .014 .003 .013 .004 .985 .001 .406 .057

FACTOR 1.000 .001 .684 .041 .014 .002 .012 .002 .013 .004 .984 .001 .490 .024

DIP-I .999 .002 .631 .025 .013 .004 .012 .002 .010 .002 .986 .001 .525 .109

DIP-II 1.000 .001 .643 .043 .013 .003 .014 .002 .011 .002 .985 .001 .568 .079

ADAGVAE 1.000 .000 .672 .021 .015 .007 .016 .005 .014 .006 .984 .001 .431 .049

ADAMLVAE 1.000 .000 .688 .027 .011 .003 .015 .006 .018 .009 .984 .002 .400 .076

QUESSARD 1.000 .000 .780 .044 .014 .004 .014 .002 .011 .003 .973 .004 .396 .055

LSBD-VAE 1.000 .001 .739 .047 .014 .003 .014 .001 .011 .001 .982 .004 .515 .099 /0 LSBD-VAE 1.000 .000 .655 .028 .015 .004 .029 .003 .013 .003 .802 .056 .112 .026 /FULL