# duet_2d_structured_and_approximately_equivariant_representations__a7bc84b5.pdf

DUET: 2D Structured and Approximately Equivariant Representations

Xavier Suau 1 Federico Danieli 1 T. Anderson Keller 1 2 Arno Blaas 1 Chen Huang 1 Jason Ramapuram 1

Dan Busbridge 1 Luca Zappella 1

Multiview Self-Supervised Learning (MSSL) is based on learning invariances with respect to a set of input transformations. However, invariance partially or totally removes transformation-related information from the representations, which might harm performance for specific downstream tasks that require such information. We propose 2D str Uctured and approximately Equivarian T representations (coined DUET), which are 2d representations organized in a matrix structure, and equivariant with respect to transformations acting on the input data. DUET representations maintain information about an input transformation, while remaining semantically expressive. Compared to Sim CLR (Chen et al., 2020) (unstructured and invariant) and ESSL (Dangovski et al., 2022) (unstructured and equivariant), the structured and equivariant nature of DUET representations enables controlled generation with lower reconstruction error, while controllability is not possible with Sim CLR or ESSL. DUET also achieves higher accuracy for several discriminative tasks, and improves transfer learning.

1. Introduction

The field of representation learning has evolved at a rapid pace in recent years, partially due to the popularity of Multiview Self-Supervised Learning (MSSL) (Chen et al., 2020; He et al., 2019; Caron et al., 2020; Grill et al., 2020; Zbontar et al., 2021). The main idea of MSSL is to learn transformation-invariant representations by comparing data views that underwent different transformations. If the transformations alter only task-irrelevant information, and if representations of multiple views are similar, then those representations should only contain task-relevant information.

1Apple 2University of Amsterdam. Correspondence to: Xavier Suau <xsuaucuadros@apple.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Figure 1. DUET. The backbone f yields a 2d representation for each transformed image f(τg(x)) (e.g., τg is a rotation by g degrees). The group marginal is obtained as the softmax (sm) of the sum of the rows, and is compared to the prescribed target (red) with our group loss LG. The content is obtained by summing the columns, and contrasted (LC) with the other view through a projection head h. The final representation for downstream tasks is the 2d one, which has been optimized through its marginals.

However, one can always find a downstream task for which the chosen transformations are relevant. For example, MSSL representations which learn to be color invariant are likely to fail at predicting fruit ripeness where color information is required (Tian et al., 2020), or at the tasks of generation or segmentatation (Kim et al., 2021).

One way to maintain information in the representations is by preserving all possible information from the input, as pursued by Info Max (Linsker, 1988) frameworks. However, it has been shown empirically and theoretically that for tasks like classification, invariance to nuisance information allows for greater data efficiency and downstream performance (Laptev et al., 2016; Tschannen et al., 2019). In an attempt to simultaneously satisfy the demands of information-rich representations (allowing for generalization to different tasks) and complex invariances (allowing for powerful discriminative representations), modern machine learning research has pursued the concept of structured representations. Colloquially, a representation can be considered structured with respect to a set of transformations if firstly, the transformation between two inputs can be easily recovered by comparing their representations, and secondly, there is a known method

DUET: 2D Structured and Approximately Equivariant Representations

for recovering the representation which is correspondingly invariant to the transformation set. An example of structured representations are convolutional feature maps, which allow for the spatial position (the translation element) to be easily extracted, while similarly allowing for global translation invariance through spatial pooling. Given the success of structured representations, significant work has gone into expanding the range of transformations for which a structured representation can be recovered, for example, rotation, scaling, and other algabraic group symmetries (Cohen & Welling, 2016; Sosnovik et al., 2020; Mac Donald et al., 2022; Jiao & Henriques, 2021; Cotogni & Cusano, 2022).

In the context of MSSL, equivariance has also been successfully used to improve distributional robustness (Dangovski et al., 2022; Lee et al., 2021; Keller et al., 2022). However, to date, this equivariance has largely been encouraged at an informational level rather than a structural level, making the careful disassociation of the equivariant and invariant aspects of the representation challenging or impossible. For example, ESSL (Dangovski et al., 2022) and Aug Self (Lee et al., 2021) make representations sensitive to a transformation by regressing the transformation parameter, making their representations theoretically equivariant, but not interpretably structured, as there is no explicit form of the transformation at representation level. Such a lack of structure makes computing invariances or controlled generation from such representations significantly more challenging.

In this work, we present DUET, a method to learn structured and equivariant representations with MSSL. Instead of learning 1-dimensional representations as in Sim CLR (Chen et al., 2020) or ESSL, DUET representations are reshaped to 2d (see Figure 1). This allows for a richer optimization through their rowand column-wise marginals, which are respectively related to the group element (the transformation parameter, e.g., rotation angle) and content (all the information that is invariant to the transformation actions). In summary, our main contributions are :

We introduce DUET, a method to incorporate interpretable structure in MSSL representations for both finite and infinite groups with negligible computational overhead1. Our approach also performs well for parameterized transformations that do not satisfy all algebraic group axioms (Serre, 1977), making it widely applicable to most transformations used in MSSL.

We show empirically that DUET representations become approximately equivariant as a by-product of their predictiveness of a transformation parameter. Importantly, we prescribe an explicit form of transformation at representation level that enables controllable generation, not achievable with ESSL or Sim CLR.

1Code available at https://github.com/apple/ml-duet

We shed some light on why certain symmetries (e.g., horizontal flips, color transformations) are harder to learn from typical computer vision datasets, due to inherent ambiguity in the data with respect to a transformation. For example, cars appear in both left and right directions, hence making it difficult to define what a non-flipped car is.

We provide extensive experiments on several datasets, comparing with Sim CLR and ESSL. We show that DUET representations are suitable for discriminative tasks, transfer learning and controllable generation.

2. Related Work

Structured and Equivariant Representations. In the unsupervised learning domain, existing works like (St uhmer et al., 2020) have extensively explored structured latent priors for the Variational Autoencoder (VAE) (Kingma & Welling, 2014), while the recent Topographic VAE (Keller & Welling, 2021) aims to induce topographic organization of the observed set of transformations. The idea of structured representations has also been connected to unsupervised learning of disentangled representations (Higgins et al., 2017; Kumar et al., 2018). Another closely related line of work focuses on learning equivariance (Cohen & Welling, 2016; Sosnovik et al., 2020; Mac Donald et al., 2022; Jiao & Henriques, 2021; Cotogni & Cusano, 2022) as a more general form of structured representations. For example, (Sosnovik et al., 2020) propose to use a basis of transformed filters to learn equivariant features, which generally leads to improved model robustness and data efficiency. NPTN (Pal & Savvides, 2018) follows on (Sosnovik et al., 2020) and proposes to use a completely learnt basis of filters, learning unsupervised invariances.

Structure in MSSL. Modern MSSL is based on discarding task-irrelevant information via image augmentations. Contrastive and non-contrastive approaches achieve this respectively by comparing augmented views of different data (Chen et al., 2020; He et al., 2019; Caron et al., 2020), or by only comparing views from the same datum (Grill et al., 2020; Zbontar et al., 2021). Several authors have explored the comparison of spatially structured representations (Bachman et al., 2019) (exploiting the Info Max principle (Linsker, 1988)) or using variants of the NCE (Gutmann & Hyv arinen,

2010) loss (L owe et al., 2019; Oord et al., 2018; Hjelm et al., 2019). Some works have studied the impact objectives have on the distributions of representations (Wang & Isola, 2020), and how these representations may be identifiable with the latent factors of the data generative process (Zimmermann et al., 2021). Recent works have tackled the preservation of information in MSSL representations. ESSL (Dangovski et al., 2022) supplements Sim CLR (Chen et al., 2020) by predicting the parameter of a transformation of choice, and

DUET: 2D Structured and Approximately Equivariant Representations

Table 1. Transformations considered with their associated parameters. Column g shows the corresponding parameters for the groupmarginal definition in DUET, and Target shows the recommended target distribution. Note that flips are mapped to { 1

4}, turning them into cyclic groups.

Transform. Finite Parameter g Target

Rot. (4-fold) {0, 90, 180, 270} { 1

8} v M Rot. (360) [ 180, 180] [0, 1] v M H. Flip {0, 1} { 1

4} v M V. Flip {0, 1} { 1

4} v M Grayscale {0, 1} {0, 1} N Brightness [0.6, 1.4] [0, 1] N Contrast [0.6, 1.4] [0, 1] N Saturation [0.6, 1.4] [0, 1] N Hue [ 0.1, 0.1] [0, 1] N RRC [0.2W, W] [0, 1] N

obtaining theoretically equivariant representations. Similarly, although not focused on equivariance, Aug Self (Lee et al., 2021) predicts the difference in transformation parameters between two views. PCL (Li et al., 2020) adds a reconstruction loss to preserve information about the input. Concurrent work (Huang et al., 2023) disentangles the feature space with masks learned via augmentations.

3. Preliminary Considerations

3.1. Groups and Equivariance

Let f : X 7 Z be a mapping from data to representations. Such mapping is equivariant to the algebraic group G = (G, ) if there exists an input transformation τ : G X 7 X (noted τg(x)) and a representation transformation T : G Z 7 Z (noted Tg(z)) so that

Tg(f(x)) = f(τg(x)), g G, x X. (1)

If τg and Tg form algebraic groups in the input and representation spaces respectively, then the mapping f preserves the structure of the input group in the representation space (homomorphism). Recall that for (G, ) to form a group, the properties of closure, associativity, and existence of neutral and inverse elements must be satisfied (Serre, 1977). Here we consider both finite and infinite groups.

3.2. On MSSL Input Transformations

In MSSL, τg is defined by a parameterized transformation applied to the input. For example, rotation is parameterized by a real angle (g R, infinite group). If g [0, 2π] then it forms a cyclic group. One can also use discrete rotations which form a finite group where g {0 , 90 , 180 , 270 }. However, not all input transformations form a group. For ex-

ample, a change in image contrast moves some pixel values out of range, thus clipping is applied which invalidates the associativity property (e.g., τ2.0(τ0.5(x)) = τ0.5(τ2.0(x))). We also include Random Resized Crop (RRC) in our study using the relative cropped width W as a proxy for scale (assuming loss of information about location and aspect ratio). All transformation parameters are mapped in [0, 1] by minmax normalization as shown in Table 1. Although some transformations do not form a group (e.g., RRC, color transformations), the concept of equivariance is often relaxed to embrace transformations which do not form groups. Note that this assumption does not invalidate our methodology for exact groups, and helps understand how our method is suitable for non-exact groups.

4. DUET Representations

In this section we describe how we can use DUET to learn representations that are structured with respect to an algebraic group G = (G, ). The overall DUET architecture is shown in Figure 1. A training input image x is transformed twice by sampling 2 group actions from the same group g1, g2 G (e.g., two angles of rotation). We obtain the transformed images xk = τgk(x) with k = 1, 2. Let f be a deep neural network backbone such that zk = f(xk) RC G, where C and G are the number of rows and columns in the representation, as shown in Figure 1. This 2-dimensional representation zk models the joint (discretized and unnormalized) distribution P(c, g|xk) where c RC and g G are two random variables defined in the content and group element domains. Our joint interpretation allows to marginalize P(c|xk) by summing the columns of zk, and P(g|xk) by summing the rows. Rather than imposing a certain dependence (or independence) structure between c and g, (conditioned on xk), we only impose our objectives on the marginals P(c|xk) and P(g|xk) and let the model learn such dependencies from the data. Note that a final Batch Normalization (BN) (Ioffe & Szegedy, 2015) layer in f will make the mean of zk to be approximately β (bias term in BN). This is important for equivariance as shown in Section 4.5. Although we focus on a single group, DUET s formulation is suited to handle multiple groups as discussed in Appendix C, which we leave as future work.

4.1. The Group Marginal Distribution

As we marginalize zk over the content dimension (C) we get {µj}G j=1, the sum of each column in zk. We obtain our discretized group marginal P(g|xk) by softmaxing µj. Since the parameters gk sampled during training are known, we can design a target distribution for P(g|xk) (red distribution in Figure 1). We use a von-Mises (v M) target q(g|xk) = v M(gk, κ) for cyclic groups, and a Gaussian (N) target q(g|xk) = N(gk, σ) for all other groups. Both

DUET: 2D Structured and Approximately Equivariant Representations

targets are chosen because of the simplicity by which we can encapsulate parameter information in their structure (their mean), and the controllability of the uncertainty about g via σ (or κ). For readability, we refer to the uncertainty as σ for both v M and N targets, where σ 1

To be comparable to P(g|xk), we also discretize our target in [0, 1] obtaining Q(g|xk). Let Ωj be the intervals of a G-sized partition, and gj their centers. Then, the discretized target is obtained by integrating the continous target according to the partition: Qj(g|xk) := Q(g = gj | xk) = R

Ωj q(g|xk)dg R 1 0 q(g|xk)dg . For the Gaussian target, we assume a slight boundary effect as we do not integrate the tails beyond Ωj.

We encourage the observed P(g|xk) to match the target by minimizing the Jensen-Shannon Divergence (DJS) between the discretized distributions. The group loss for the i-th image x(i) in a batch is

k=1 DJS P(g|x(i) k ) Q(g|x(i) k ) . (2)

The choice of σ is key to encourage structure Both very small and very large values of σ will lead to a loss of structure in the columns of z. For small σ, the target takes a form close to a δ distribution. This results in an invariant discretized target (as δ moves inside interval Ωj) or abrupt target changes (as δ moves from Ωj to Ωj+1), which prevents learning proper structure. Conversely, for large σ, the target will be close to a uniform distribution, thus removing all information about the group element (all columns contribute equally). In Appendix H.1 we find empirically that σ 0.2 is optimal in our setting. Note that this value corresponds to a normal distribution N( , 0.2) that covers the [0, 1] domain within approximately its 3σ span (when centered at 0.5), being a good trade-off in terms of structure. Interestingly, since our group elements are bounded in [0, 1], the value of σ can be kept constant for all transformation groups and data sets.

4.2. The Content Marginal Distribution

As we marginalize zk by summing over the group dimension (G), we obtain P(c|xk), the probability of observing the content c given xk. Such distribution is invariant to the group actions, and contains all relevant information of xk not related to the group G. For example, the content of an image of a horse is still a horse regardless of is rotation. We maximize the agreement between the content of two views of an image (x1, x2). Our content representation is defined directly by the values of P(c|xk), noted as ck RC. Following the recent trends in MSSL, both content representations are projected with a network h. Then we use the NTXent loss (Chen et al., 2020) in form of LC =

NTXent(h(c1), h(c2)) for a Sim CLR-based architecture.

4.3. The DUET Loss

Our final loss for a full batch of N images is

i=1 L(i) C + λL(i) G . (3)

LC encourages similarity between the content representations of 2 views, explicitly made invariant to the group action, as opposed to Sim CLR which contrasts representations to achieve invariance to the group action. The parameter λ controls how strongly the group structure is imposed.

4.4. Recovering the Transformation Parameter

An interesting property of DUET representations is the ability to recover the transformation parameter of a test image without relying on extra regression heads. This property is useful to transform representations equivariantly (see Section 4.5). It also enables interpretability, since one can analyze the default transformation parameters associatied to an image or a dataset. Assuming optimal training of LG, the group marginal will resemble the imposed target. Therefore, the transformation parameter g of an arbitrary image xk for a Gaussian target is directly recoverable as g = E[g|xk] PG j=1 Pj(g|xk)gj. In practice, for improved robustness, we fit a Gaussian (or v M) function to the values Pj(g|xk), and we estimate g as is argmax.

4.5. Equivariance in DUET

Similarly to the approach of ESSL, DUET encourages equivariance by making the neural network sensitive to the transformation parameter g. However, in our method this sensitivity is defined explicitly via Equation (2), such that applying a transformation τg(x) in input space results in shifting by g the mean of the group marginal distribution corresponding to their representation z = f(x). In practice, we prescribe a-priori a form for the feature-space transformation Tg corresponding to the input-space transformation τg in Equation (1), with the advantage of gaining additional controllability over such transformations (see also Section 5.2).

Specifically, we design Tg according to the following considerations. Assuming optimal training of LG, we have that the recovered group marginal distribution P(g|xk) for a given input x transformed by τgk resembles the target Q(g|xk). For Equation (1) to hold, we need to design Tg such that applying the column sum and softmax operations used to derive P(g|xk) (see Section 4.1) to Tgk(z) also resembles Q(g|xk). We can ensure this by changing the column sums of z (denoted as {µj}G j=1) with values {ˆµj}G j=1 that after applying the softmax yield Q(g|xk), i.e. ˆµj = softmax 1( ˆQj) with ˆQj = Qj(g|xk) for ease of no-

DUET: 2D Structured and Approximately Equivariant Representations

tation. There are infinitely many solutions for ˆµj, so we choose the ˆµj that satisfies P

j ˆµj = βj. This choice comes from the fact that the final BN layer in f will make the mean of z close to the BN bias terms βj.

ˆµj = ln ˆQj + ln X

j eˆµj with X

j ˆµj = βj. (4)

The solution to this equation is given by

ˆµj = ln ˆQj + βj 1

j ln ˆQj. (5)

Finally, we define Tg so that it swaps the mean µj with ˆµj

Tg(z) = z M + c Mg, (6)

where all elements of each column j of M (or c Mg) take the value µj (or ˆµj). As such, applying the column sum and softmax operations to Tgk(z) yields the same values as applying them to zk (at optimality), which is a necessary condition for Equation (1) to hold. Furthermore, defined in this way, Tg satisfies the group axioms (Appendix B, again assuming LG is minimized).

In practice, as it also happens in other works such as ESSL or TVAE, we cannot expect Equation (1) to hold always (i.e. for all x and g), as that would require perfect generalization of the learned equivariance. However, for our method, we can bound the equivariance generalization error w.r.t. unseen g (Appendix A), and furthermore demonstrate that it is small in practice in Section 5.1.

On predictiveness and equivariance. It is key to differentiate between predictiveness and equivariance. While predictiveness implies equivariance, the opposite is not always true (e.g., invariance is a specific case of equivariance that does not imply predictiveness). Therefore, we emphasize that the approximate equivariance in DUET is a by-product of the predictiveness of g at group marginal level.

5. Experimental Results

5.1. Empirical Proof of Equivariance

We start with an empirical validation of equivariance by measuring how Equation (1) holds for real data. To do this, we use the transformation Tg from Equation (6) and compute representations f(τg1(x)) and Tg2(f(x)) g1, g2 G. An equivariant map should result in a minimal L2 distance ℓg1,g2 = f(τg1(x)) Tg2(f(x)) 2 2 when g1 = g2 . To verify this, we plot the pairwise ℓg1,g2 for all elements g1, g2 and different transformations. More precisely, we sweep 100 values of g1, g2 in [0, 1] for 1000 randomly selected

CIFAR-10 (Krizhevsky, 2009) test images and we show the average pairwise L2 distance in Figure 2.

For infinite groups (i.e. color transformations and rotation (360)), there is a strong similarity along the diagonal, validating Equation (1). For finite groups (rotation 4-fold, flips and grayscale) we also see a strong similarity at the observed group elements. For example, rotation 4-fold shows 4 minima at the observed (normalized) angles. These plots also help to understand how the model generalizes to unseen group elements. Interestingly, equivariance for horizontal flip is only mildly learnt due to its ambiguity in the dataset (see Section 6 for extended discussion). Indeed, flipped images appear naturally in CIFAR-10 (e.g., cars looking to the right or left), and thus there is more ambiguity about the meaning of image flipping. Vertical flips are nicely learnt, since they do not naturally appear in data. Another interesting observation is that grayscale yields a constant representation as we reduce the saturation (horizontal axis) and then shows a sudden jump close to 1 (grayscale image). The model has learnt that, as soon as the image presents some hint of color, it is not grayscale, unless it is purely grayscale. Note also that grayscale does not form an algebraic group, yet DUET is still able to learn its structure.

5.2. DUET Representations for Group Conditional Generation

In Figure 3 we showcase the benefit of equivariance in DUET representations to conditioning generation on specific group elements. To this end, we train a decoder on frozen pre-trained DUET representations. In this work we do not aim to obtain state-of-the-art generation quality, but rather use a decoder for visual validation of our hypotheses. Note that group conditional generation is not feasible with MSSL methods like Sim CLR or ESSL since there is no explicit transformation at representation level.

Here we exploit the equivariant property of DUET representations for controlled generation. We first obtain the representation of a test image z = f(x) (leftmost images in Figure 3), then we create multiple transformed representations {Tg(z)} using Equation (6), by sweeping g between 0 and 1, and finally we decode all {Tg(z)}. In Figure 3 we show the decoded images for different datasets and groups. Notice how we can recover the input transformation by only transforming the representations, which provides yet an additional visual proof of equivariance in DUET. In Appendix F we show that the reconstruction error of DUET is up to 66% smaller than with Sim CLR (rotation (4-fold)) and up to 70% smaller than with ESSL (grayscale).

DUET: 2D Structured and Approximately Equivariant Representations

Figure 2. Empirical validation of equivariance in DUET. We measure the L2 distance between the representations of a transformed image f(τg(x)) and the transformed representations of that image Tg(f(x)), varying g [0, 1] along both axes. The plots show the average L2 distance for 1000 CIFAR-10 test images. Note the strong similarity for the same group element (diagonal), and the cyclic nature of rotations or flips when using a v M target (top row), as opposed to the Gaussian target (bottom row). More results in Appendix E.

Figure 3. Equivariance in DUET. We encode a test image (leftmost images), transform its representation using Tg (Equation (6)) for several g, and then decode the transformed representations. See how transforming the representations exposes the input transformation learnt by the model, empirically proving equivariance.

(a) MNIST with Rot. (360) (left) and horizontal flip (right).

(b) CIFAR-10 for Rot. (360) and color transformations.

5.3. DUET Representations for Classification

5.3.1. RRC+1 EXPERIMENTS

In this section we analyze how DUET representations perform for discriminative tasks. Following the procedure in the ESSL work, where a single transformation is applied on top of Random Resized Crop (RRC), we carry out the set of RRC+1 experiments. We compare our method with Sim CLR and ESSL2. We also compare with a variant of our method (coined DUETλ=0) optimized without the group loss, that is with λ = 0 in Equation (3). Notice that in DUETλ=0 we still reshape the features to 2d and sum over the columns to obtain the content representation (that is con-

2ESSL representations are implicitly equivariant but do not guarantee interpretable structure with respect to the transformation.

trasted), which is a fundamental difference with Sim CLR. DUETλ=0 learns unsupervised invariances very similarly to what NPTN (Pal & Savvides, 2018) does, but does not guarantee equivariance. For RRC+1, DUET uses λ = 10, except for rotations and vertical flip for which we use λ = 1000 according to the empirical study in Appendix H.1. The remaining parameters are set to σ = 0.2 and G = 8. The full training procedure is provided in Appendix H.

In Figure 4 we show the accuracy of a linear tracking head for the RRC+1 experiment on CIFAR-10. The horizontal dashed line shows the baseline performance of Sim CLR with only RRC. For all considered transformations, we show results from training with a N target group-marginal. For cyclic transformations (rotations and flips), we further specialize the target and consider a (periodic) v M distribution instead, reporting results for this case also. It is important to point out that, for DUET, the tracking head receives our 2d representation flattened, and as such is of the same dimensionality as in compared methods.

Our method outperforms Sim CLR for all transformations, and even improves over Sim CLR with RRC only by learning structure with respect to scale. Note that, by construction, ESSL cannot improve over the RRC-only baseline. A prominent result is the performance of DUET with color transformations. For the discrete transformation grayscale, ESSL degrades performance by 12.8% with respect to Sim CLR with grayscale, while DUET improves it by 1.75%. For continuous color transformations, DUET improves over Sim CLR between 3-5%, while ESSL degrades the performance by up to 4.3% (brightness). This shows that the implicit equivariance in ESSL is not sufficient in this case.

We also observe in Figure 4 that ESSL does not improve

DUET: 2D Structured and Approximately Equivariant Representations

Figure 4. Test top-1 performance of a linear tracking head on CIFAR-10. It can be observed that DUET improves over Sim CLR and ESSL for all transformations. Notably for continuous color transformations, ESSL significantly degrades performance unlike DUET.

over Sim CLR for horizontal flips, while DUET (v M) improves by 2.8%. In general, for ambiguous transformations like horizontal flip (see Section 6), we find that learning unsupervised structure with DUETλ=0 is beneficial. We also see that structure (and equivariance) is strongly helpful for vertical flips. For the more complex cyclic transformations, DUETλ=0 underperforms by a large margin, since such complex structure is harder to learn in a completely unsupervised way. This result shows that accounting for the topological structure of the transformation (as studied by Falorsi et al. (2018)) is of great importance, and opens the door to further research in this direction. Surprisingly, DUETλ=0 outperforms Sim CLR. We speculate that the unsupervised structure learnt by DUETλ=0 might induce a more discriminative organization of the embedding space.

Table 2 benchmarks the more complex tasks CIFAR-100 (Krizhevsky, 2009) and Tiny Image Net (Li et al., 2017). We report the average across the cyclic groups and the colorrelated groups for better readability. DUET achieves the highest accuracy compared to all algorithms tested, including DUETλ=0, and across all groups but horizontal flip. DUET also improves under color transformations with respect to Sim CLR with the same transformations, while ESSL shows a degradation. Indeed, the datasets used in Table 2 present higher data scarcity per class than CIFAR10. In such setting, the structure learnt by DUET shines over unstructured methods like ESSL.

5.3.2. FULL AUGMENTATION STACK EXPERIMENTS

In this section we use the full augmentation stack as in Sim CLR (see details in Appendix G). We learn structure for one group at a time, while applying the full stack on input images. Note that in the full stack setting, we use a fixed λ = 10 for DUET3. We observed in this case that extremely large λ can harm performance since multiple transformations add ambiguity to the group being learnt.

3We did not perform an extensive hyper-parameter tuning, the focus of this work being an exploration of structured representations in MSSL.

Table 3 reports the test top-1 accuracy on CIFAR-10, CIFAR100 and Tiny Imagenet. One interesting observation is that DUET becomes better than the compared methods as the dataset complexity increases, achieving the best average accuracy for all sets of transformations on Tiny Image Net. For smaller and simpler datasets like CIFAR-10, DUET outperforms ESSL for color transformations, but ESSL is better for cyclic transformations. Still, DUET outperforms the Sim CLR baseline for cyclic transformations.

Interestingly, neither DUET nor ESSL outperform Sim CLR by becoming equivariant to horizontal flips, as discussed in Section 6. Nevertheless, DUET still outperforms ESSL for horizontal flips by 0.86%, 4.4% and 4.18% on CIFAR-10, CIFAR-100 and Tiny Image Net respectively. Similarly, as the dataset complexity increases, DUET performs better than ESSL for vertical flips. Another interesting result is the effectiveness of ESSL with rotations, where DUET remains subpar but better than the Sim CLR baseline. The RRC column shows that DUET, by just learning structure to scale (approximately, as explained in Section 3.2), can improve accuracy using the vanilla Sim CLR augmentation stack.

It is surprising how well DUETλ=0 performs in the full stack setting, surpassing DUET for simpler datasets. Indeed, DUETλ=0 learns an unsupervised structure, thus accounting for the interdependencies between the transformations applied. However, as observations of the transformation of interest are scarcer (e.g., more complex datasets or less data per class) optimizing for a known structure is beneficial.

5.4. Transfer to Other Datasets

DUET s structure to rotations yields a gain of +21% with respect to Sim CLR when transferring to Caltech101 (Li et al., 2022), and between +5.97% and +16.97% when transferring to other datasets like CIFAR-10, CIFAR-100, DTD (Cimpoi et al., 2014) or Oxford Pets (Parkhi et al., 2012). Structure to color transformations also proves beneficial, with a +6.36% gain on Flowers (Tung, 2020) (grayscale), Food101 (Bossard et al., 2014) (hue) and +7.13% on CIFAR-100 (hue). Horizontal flip is the transformation that sees less gain due to its ambiguity, as discussed in Section 6.

DUET: 2D Structured and Approximately Equivariant Representations

Table 2. RRC+1 results: Accuracy of a linear tracking head on CIFAR-100 and Tiny Image Net. We also show the average over cyclic (v M target) and non-cyclic (N target) transformations. DUET improves over Sim CLR for all groups, while ESSL worsens performance for color transformations. We report the meanstd over 3 runs.

Dataset Method RRC Rot. (360) Rot. (4-fold) H. Flip V. Flip Avg. Grayscale Brightness Contrast Saturation Hue Avg.

Sim CLR 38.890.26 32.170.14 35.520.33 39.850.13 36.680.27 36.060.18 47.410.40 45.000.26 44.510.04 42.270.70 43.610.19 44.560.26 ESSL - 38.030.36 44.360.72 38.780.54 42.170.87 40.840.51 34.200.60 38.330.25 38.660.02 38.920.08 37.640.39 37.550.22 DUETλ=0 42.630.11 34.770.72 38.280.19 43.540.67 40.100.82 39.170.49 48.870.18 48.120.19 47.740.57 45.390.34 46.320.61 47.290.31 DUET 45.250.10 42.170.42 47.250.26 41.820.46 45.380.73 44.160.38 50.910.49 50.180.45 49.770.46 48.750.22 48.540.78 49.630.39

Tiny Image Net

Sim CLR 26.910.13 21.340.08 24.400.24 27.900.37 26.740.30 25.090.20 31.351.29 29.950.14 29.680.18 28.600.29 28.200.40 29.550.38 ESSL - 25.350.10 30.410.25 27.130.14 29.110.27 28.000.16 23.510.18 26.450.07 26.320.61 26.750.72 26.000.11 25.800.28 DUETλ=0 29.570.47 24.220.30 26.960.37 30.780.23 28.980.42 27.730.27 31.550.14 32.540.18 32.230.17 31.430.24 30.450.60 31.640.22 DUET 31.260.21 27.780.24 31.550.28 30.340.59 31.710.53 30.340.33 34.920.24 33.960.16 34.200.34 33.420.35 32.640.03 33.830.18

Table 3. Full Stack results. We show the average accuracy of a linear tracking head over cyclic (v M target) and non-cyclic (N target) transformations. As the task complexity increases, DUET achieves better accuracy than the compared methods. Columns Rot. (360), Rot. (4-fold) and V. Flip require an additional transformation. We report the meanstd over 3 runs.

Dataset Method RRC Rot. (360) Rot. (4-fold) H. Flip V. Flip Avg. Grayscale Brightness Contrast Saturation Hue Avg.

Sim CLR 87.420.01 79.900.50 81.500.44 87.480.06 82.780.23 82.920.22 87.410.03 87.510.11 87.570.19 87.490.08 87.670.34 87.530.11 ESSL - 86.550.13 89.330.32 84.780.40 86.660.21 86.830.22 83.590.43 85.780.25 86.310.16 87.120.30 86.390.44 85.840.26 DUETλ=0 87.500.20 79.050.37 81.320.18 87.730.19 82.660.14 82.690.22 87.690.17 87.470.13 87.340.20 87.540.33 87.630.20 87.530.20 DUET 87.220.10 81.700.30 83.490.16 85.640.08 83.840.22 83.670.15 87.400.08 86.970.22 87.050.37 87.520.19 87.970.11 87.380.16

Sim CLR 61.400.17 56.400.30 57.320.03 61.480.28 56.730.48 57.980.19 61.430.21 61.310.05 61.680.57 61.570.41 61.300.03 61.460.18 ESSL - 58.320.06 63.280.28 55.220.29 57.180.18 58.500.17 55.100.47 57.920.38 58.060.58 60.250.34 58.910.30 58.050.34 DUETλ=0 62.130.13 55.490.22 57.790.40 62.250.34 56.880.18 58.100.29 62.320.26 62.390.27 62.470.16 62.540.20 62.290.29 62.400.24 DUET 62.170.28 55.660.39 58.010.31 59.620.11 57.400.15 57.670.20 62.180.51 62.240.31 61.900.72 62.670.19 63.310.21 62.460.32

Tiny Image Net

Sim CLR 42.160.16 37.350.19 39.230.15 42.310.06 39.350.09 38.500.09 42.110.23 42.320.08 42.340.10 42.270.01 42.460.27 42.300.10 ESSL - 37.530.21 42.860.29 36.250.13 37.180.77 38.460.29 35.500.30 37.940.13 38.660.50 40.550.74 40.490.05 38.630.28 DUETλ=0 43.070.11 36.280.95 39.540.41 42.430.33 38.870.40 39.280.52 42.790.16 42.610.34 42.980.08 42.860.30 42.900.46 42.830.27 DUET 43.560.54 38.060.06 40.030.28 40.430.21 39.360.50 39.470.18 42.551.27 43.410.01 43.710.07 44.130.57 44.610.10 43.680.29

6. Discussion and Limitations

On the Dimensionality of DUET Representations. We reshape the output of the backbone (RD) to z RC G. The final representation used for downstream tasks is a flattened (RD) version of z. For a fair comparison, Sim CLR and ESSL also yield RD representations.

Trading off Structure and Expressivity. By increasing G we reduce the effective dimensionality of the content representations (RC) contrasted through LC. This implies a trade-off between structure (improves generation, transferrability) and expressivity (improves discrimination). Such effect is visible in the transfer learning results, where learning structure to rotation is not useful when transferring to the Flowers dataset. Indeed, such dataset contains many circular flowers, which are rotation (and flip) invariant.

Transformation Ambiguity. A dataset containing examples related by input transformations results in transformation ambiguity, and the distribution over group actions P(g|xk) becomes multi-modal. This is shown in Figure 5 where the weight of each mode corresponds to the observed probability in the dataset, i.e., P(g|xk) reflects the bias of the dataset with respect to the transformation. Additional results in Appendix J show ambiguity also for color trans-

Figure 5. Observed P(g|x) for horizontal (left) and vertical (right) flips, obtained from 1000 CIFAR-10 images. Note the inherent ambiguity for horizontal flips. Also, see that the modes of the distributions correspond to the mapped points specified in Table 1.

formations, e.g., natural images may present a different default hue, yielding a spread P(g|xk). This phenomenon is also observed in Section 5.3.2 and Section 5.4, where the notion of a left-flipped image is ambiguous, whereas a vertically flipped image is not, and only the latter transformation yielded a performance gap between equivariant and invariant methods.

Are c and g Dependent? To better understand the dependency between c and g (conditioned on xk) quantitatively, we measure the difference P = P(c, g|xk) P(c|xk)P(g|xk) 2 2. In Table 4 we report the average difference for the DUET representations of 100 images from CIFAR-10, for 100 images with independent and identically distributed (iid) pixels and for 100 random representations

DUET: 2D Structured and Approximately Equivariant Representations

Table 4. Dependence of c and g conditioned on xk. The learnt marginal representations for content (c) and group element (g) are dependent. This is a core strength of DUET, where group structure and content are not assumed independent, but rather with specific dependencies learnt from data.

DUET w/ CIFAR-10 178.17 DUET w/ iid pixels 0.015 iid representations 0.00075

(iid features). Note that such difference is expected to be 0 for the random representations (independent) and close to 0 for the iid pixels (no symmetries in the data).

Computational Requirements The training time of Sim CLR and DUET are practically the same. DUET s extra requirements suppose a negligible overhead, namely: are a sum over rows and cols of z and the computation of the Jensen-Shannon Divergence in LC. Interestingly, the projection head h in DUET is smaller than in Sim CLR, since the content features are of lower dimension, effectively reducing the model parameters with respect to Sim CLR.

Compared to ESSL, DUET shows an important computational gain. Indeed, the time required for ESSL to train depends on the group chosen. Taking the implementation in (Dangovski et al., 2022) for 4-fold rotations, the backbone consumes 2 + 4 versions of each image, resulting in an overall training time 2.01 longer than that of DUET. For other transformations, ESSL requires 2 + 2 images being consumed (e.g., flips) or 2 + 1 (e.g., contrast); thus resulting in longer training time than DUET in all cases.

7. Conclusion

We introduce DUET, a method to learn structured and equivariant representations using MSSL. DUET uses 2d representations that model the joint distribution between input content and the group element acting on the input. DUET representations, optimized through the content and group element marginal distributions, become structured and equivariant to the group elements. We design an explicit form of transformation at representation level that allows exploiting equivariance for controlled generation. Our results show that DUET representations are expressive for generative purposes (lower reconstruction error) and also for discriminative purposes. Overall, this work shows that accounting for the topological structure of input transformations is of great importance to improve generalization in MSSL.

Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning representations by maximizing mutual information across views. In Neur IPS, volume 32, 2019.

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 mining discriminative components with random forests. In ECCV, 2014.

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Neur IPS, 2020.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. ICML, 2020.

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , and Vedaldi, A. Describing textures in the wild. In CVPR, 2014.

Cohen, T. and Welling, M. Group equivariant convolutional networks. In ICML, pp. 2990 2999. PMLR, 2016.

Cotogni, M. and Cusano, C. Offset equivariant networks and their applications. Neurocomputing, 502:110 119, 2022.

Dangovski, R., Jing, L., Loh, C., Han, S., Srivastava, A., Cheung, B., Agrawal, P., and Soljaˇci c, M. Equivariant contrastive learning. ICLR, 2022.

Falorsi, L., de Haan, P., Davidson, T. R., De Cao, N., Weiler, M., Forr e, P., and Cohen, T. S. Explorations in homeomorphic variational auto-encoding. ar Xiv preprint ar Xiv:1807.04689, 2018.

Grill, J.-B., Strub, F., Altch e, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., Piot, B., Kavukcuoglu, K., Munos, R., and Valko, M. Bootstrap your own latent: A new approach to self-supervised learning. Neur IPS, 2020.

Gutmann, M. and Hyv arinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297 304, 2010.

He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. In CVPR, 2016.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. CVPR, 2019.

DUET: 2D Structured and Approximately Equivariant Representations

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta VAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. ICLR, 2019.

Huang, C., Goh, H., Gu, J., and Susskind, J. Mast: Masked augmentation subspace training for generalizable selfsupervised priors. In ICLR, pp. 297 304, 2023.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, volume 37, pp. 448 456, 2015.

Jiao, J. and Henriques, J. F. Quantised transforming autoencoders: Achieving equivariance to arbitrary transformations in deep networks. In BMVC, 2021.

Keller, T. A. and Welling, M. Topographic VAEs learn equivariant capsules. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021.

Keller, T. A., Suau, X., and Zappella, L. Homomorphic self-supervised learning. Neur IPS SSL Workshop, 2022.

Kim, S., Kim, S., and Lee, J. Hybrid generative-contrastive representation learning. ICLR, 2021.

Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In ICLR, 2014.

Krizhevsky, A. Learning multiple layers of features from tiny images. pp. 32 33, 2009.

Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, 2018.

Laptev, D., Savinov, N., Buhmann, J. M., and Pollefeys, M. Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks, 2016.

Lee, H., Lee, K., Lee, K., Lee, H., and Shin, J. Improving transferability of representations via augmentation-aware self-supervision. Neur IPS, 2021.

Li, F.-F., Karpathy, A., and Johnson, J. cs231n course at stanford university, 2017. URL https://www.kaggle. com/c/tiny-imagenet.

Li, F.-F., Andreeto, M., Ranzato, M., and Perona, P. Caltech 101, 2022.

Li, T., Fan, L., Yuan, Y., He, H., Tian, Y., Feris, R., Indyk, P., and Katabi, D. Addressing feature suppression in unsupervised visual representations. ar Xiv preprint ar Xiv:2012.09962, 2020.

Linsker, R. Self-organization in a perceptual network. Computer, 21(3):105 117, 1988.

L owe, S., O Connor, P., and Veeling, B. Putting an end to end-to-end: Gradient-isolated learning of representations. In Neur IPS, 2019.

Mac Donald, L. E., Ramasinghe, S., and Lucey, S. Enabling equivariance for arbitrary lie groups. pp. 8183 8192, 2022.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. Neur IPS, 2018.

Pal, D. K. and Savvides, M. Non-parametric transformation networks, 2018.

Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. V. Cats and dogs. In CVPR, 2012.

Serre, J.-P. Linear representations of finite groups., volume 42 of Graduate texts in mathematics. Springer, 1977.

Sosnovik, I., Szmaja, M., and Smeulders, A. Scaleequivariant steerable networks. In International Conference on Learning Representations, 2020.

St uhmer, J., Turner, R., and Nowozin, S. Independent subspace analysis for unsupervised learning of disentangled representations. In AISTATS, 2020.

Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., and Isola, P. What makes for good views for contrastive learning? Neur IPS, 2020.

Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On mutual information maximization for representation learning, 2019.

Tung, K. Flowers Dataset, 2020. URL https://doi. org/10.7910/DVN/1ECTVN.

Wang, T. and Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, volume 119, pp. 9929 9939. PMLR, 2020.

Wu, Y. and He, K. Group normalization. In ECCV, pp. 3 19, 2018.

Zbontar, J., Jing, L., Misra, I., Le Cun, Y., and Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. ICML, 2021.

DUET: 2D Structured and Approximately Equivariant Representations

Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In Meila, M. and Zhang, T. (eds.), ICML, volume 139 of Proceedings of Machine Learning Research, pp. 12979 12990. PMLR, 2021.

DUET: 2D Structured and Approximately Equivariant Representations

A. Bounding the Equivariance Error

To introduce some notation, let us assume that, for an input data point x0, the training procedure has seen the augmentations xi = τgi(x0), generating the respective representations zi = f(xi) in feature space. Notice that, with some abuse of notation , in this scenario we consider z to be the column reduction of the feature space, since that is the only part dedicated to guaranteeing equivariance. Since we are at the optimum, these must produce group marginal distributions Qj(zi) ˆQgi, where ˆQgi represents the discretization of the target distribution with mean gi. At the end of training, if exact equivariance is reached (i.e. if LG is minimized), a newly generated augmentation x = τg(x0) for a given transformation parameter g, would be mapped by our neural network to the feature vector z, such that Q(z) ˆQg. Since this augmentation was not seen during training time, however, this is not guaranteed. We are interested in providing a bound on the error between the representation actually recovered, and the ideal one, which gives us an indication of how much our neural network can violate equivariance for unseen transformation parameters g. This is given by the following theorem. Theorem A.1. For a training point x0, at the optimum of LG = 0, the equivariance error of a neural network f trained with loss Equation (2) is bounded by

f(τg(x0)) Tg(f(x0)) (Lf Lτg + LTg) min i |g gi|, (7)

where LTg, Lτg and Lf are the Lipschitz constants associated with the transformations Tg, τg, and the network f, respectively.

Proof. Using triangular inequality, we get

f(τg(x0)) Tg(f(x0)) f(τg(x0)) f(τgi(x0)) + f(τgi(x0)) Tg(f(x0)) (8)

for any given augmentation xi = τgi(x0) seen during training time. At the optimum, we have by construction that f(τgi(x0)) = Tgi(f(x0)), which allows us to rewrite the second term as

f(τgi(x0)) Tg(f(x0)) = Tgi(f(x0)) Tg(f(x0)) LTg|g gi| (9)

Notice LTg depends on the target discretization chosen: for the Gaussian target and -norm, we recover it analytically in Lemma A.3. The first term, instead, becomes

f(τg(x0)) f(τgi(x0)) Lf τg(x0) τgi(x0) Lf Lτg|g gi|. (10)

Combining these results together, we recover the target bound.

Notice that for a discrete group, instead, it is possible to train f(x) so that it achieves exact equivariance: Corollary A.2. Given a discrete group G, a neural network f(x) trained with loss Equation (2) achieves equivariance at the optimum, if it is exposed to all group transformations.

Proof. The proof follows directly from Theorem A.1 by noticing that g gi = 0 necessarily, if all group transformations have been seen during training time.

Theorem A.3. For a Gaussian target, ˆQi(g) =

Ωi N(g,σ)( g) d g R

[0,1] N(g,σ)( g) d g, the Lipschitz continuity constant for Tg in -norm is

given by ˆµ G 1(0), with ˆµj defined in Equation (5).

Proof. Starting from the definition of Tg(z) in Equation (6), and using the mean-value theorem, we get

Tg(z) Tˆg(z) = ˆ Mg ˆ Mˆg = max j |ˆµj(g) ˆµj(ˆg)| = max j |ˆµ j( gj)||g ˆg| (11)

for some (possibly different for different j) gj [g, ˆg]. We remind that ˆµj(g) is defined in equation 5 as

ˆµj(g) = ln ˆQj(g) 1

i ln ˆQi(g) = ln

i ln Φi+1 i (g) ΦG 0 (g)

= ln Φj+1 j (g) 1

i ln Φi+1 i (g), with Φj i = Z gj

gi N(g, σ)(x) dx, and gi = i

DUET: 2D Structured and Approximately Equivariant Representations

so that its derivative can be compactly written as

ˆµ j(g) = hj(g) 1

i hi(g), where hi(g) = ( Φi+1 i ) (g) Φi+1 i (g) . (13)

Our goal is to bound ˆµ j(g), which can be quantified starting from considerations on the various hj(g). It can be proven that these are:

equivalent modulo translations: hj(g) = hj i(g i/G);

antisymmetric with respect to g around the centerpoint g j = (gj+1 + gj)/2: hj(g j + g) = hj(g j g);

antisymmetric with respect to j: hj(g j + g) = h G j(g G j g);

decreasing: h j(g) 0;

convex for g < g j : g g j = h j (g) 0.

We can gain a better intuition about how to effectively bound ˆµ (g) by rewriting equation 13 using the equivalence under translations of hj(g):

ˆµ j(g) = 1

this shows that for each j we are averaging the differences between hj(g) and the same function evaluated at G equispaced points g (i j)/G. Since hj(g) is decreasing, we deduce that this difference is positive whenever i < j, and negative otherwise. We have then that the maximum absolute value of ˆµ j(g) is always attained for the most extreme j, since that guarantees that the largest number of terms share the same sign. Without loss of generality (by symmetry), we can consider j = G 1, and we have max j |ˆµ j(g)| = ˆµ G 1(g) g [0, 1]. (15)

It suffices now to bound this quantity in [0, 1]. Due to the concavity of hj(g), its maxima will be at the boundary, and specifically at g = 0. This can be shown by simply comparing the values at 0 and at 1 (we drop the subscript G 1 and consider h G 1(g) = h(g) from now on):

ˆµ G 1(0) ˆµ G 1(1) = 1

h(0) h G 1 i

h(1) h 1 + G 1 i

h(0) + h G 1

h(0) + h G 1

where we exploited the antisymmetry of h G 1(g) around g G 1 = 1 1/(2G) to aptly change the inner arguments, as well as the convexity of h(g) for g < g G 1 to state that h(0) + h G 1

G + h G 1 i

G , for each i. This allows us to explicitly write the Lipschitz constant for the transformation Tg(z) as

Tg(z) Tˆg(z) ˆµ G 1(0)|g ˆg|. (17)

B. Proof of Axioms for Tg in Equation (6)

Notice that Tg, thus defined, satisfies the group axioms at proper training (LG is minimized). In fact:

Neutral: g = 0 s.t. T0(z) = z. Easily proven since Mg0 = c Mg0+0.

DUET: 2D Structured and Approximately Equivariant Representations

Inverse: g 1 = g s.t. Tg 1 Tg(z) = z. Let z = Tg(z), then Tg 1(z ) = z Mg 0 + c Mg 0 g. Since g 0 = g0 +g, then Tg 1 Tg(z) = z Mg0 + c Mg0+g Mg0+g + c Mg0+g g = z.

Associativity: Similar reasoning as for the inverse property with 2 different elements.

Closure: We work in RD at representation level, so closure is verified.

DUET: 2D Structured and Approximately Equivariant Representations

C. DUET for Multiple Groups

Modern MSSL frameworks use complex augmentation stacks that compose several transformations. While learning structure with respect to a single group is interesting, one could benefit from learning such structure for a set of groups. For readability, we focus on the two group case (GA and GB); but the following reasoning can easily be extended to more groups.

In order to model the interdependencies between groups and content, one can learn the joint distribution P(c, g A, g B|xk), where g A GA and g B GB are 2 random variables representing the respective group elements. Such approach implies that our backbone f maps to RC |GA| |GB|. The marginal distributions are now obtained by summing over the non-desired dimensions (e.g., over C and GA to obtain P(g B|xk)). Using these new marginals, we define the multi-group loss as

LMulti-G = 1

l={A,B} LGA + LGB. (18)

However, as the number of groups increases, modelling the joint distribution becomes intractable. In practice, keeping C constant, the dimensionality of z increases in O(Gn) with the number of groups.

To address scalability, we propose to relax the formulation and let our backbone f map into RC (|GA|+|GB|), so that the dimensionality of z increases in O(n G) with the number of groups. Using this relaxation we actually consider GA and GB independent, although their structure is learnt jointly during training. In practice, z is divided into two blocks, with |GA| and |GB| columns each. In this scenario, P(g A|xk) is obtained by summing over columns of the GA block, and P(g B|xk) by summing over the columns of the GB block. The content marginal P(c|xk) is obtained by concatenating the sum over the rows of each group block.

D. Recovering the Transformation Parameter for a Von-Mises Distribution

Let xi be samples of a v M(x|µ, κ) with unknown parameters µ and κ. We want to recover the parameter µ, which corresponds to the group element that yields such v M prior. Let r = P

i xi be the baricenter of the samples with respect to the origin, then g = µ = angle(r).

E. Empirical Equivariance: Additional Plots

Figure 6. Empirical validation of equivariance for cyclic groups with a non-cyclic Gaussian prior. Note the difference with the top row of Figure 2. In the Gaussian case, the cyclic nature of rotation and flip is not observed, and equivariance is less well satisfied.

DUET: 2D Structured and Approximately Equivariant Representations

Figure 7. Examples of the transformations τg applied on the input images x to obtain the plots in Figure 2 and Figure 6. For flips, we simulate a gradual flip by alpha-blending the 2 flipped images.

F. Reconstruction Error

In order to verify our hypothesis that structured representations are beneficial for generation, we measure the reconstruction error obtained with the decoders used in Section 5.2. We use a mean squared error loss for reconstruction: L(i) rec = d(f(τg(x(i)))) τg(x(i)) 2 2, where d( ) is a decoder network. In Figure 8 we plot the final test Lrec on CIFAR-10 for decoders trained on frozen DUET, ESSL and Sim CLR representations, for some of the transformations analyzed. The obtained reconstruction error with DUET is up to 66% smaller than with Sim CLR (rotation (4-fold)) and up to 70% smaller than with ESSL (grayscale).

Figure 8. Reconstruction error (smaller is better) obtained with decoders trained on frozen DUET, ESSL and Sim CLR representations. The horizontal dashed line shows the baseline error of Sim CLR with only RRC.

G. Full Stack Augmentations

In Section 5.3.2 we report the performance of DUET and other methods using the full Sim CLR augmentation stack. More precisely, the augmentations used are:

Random Resized Crop(scale=(0.2, 1.0))

Color Jitter(brightness=0.4, saturation=0.4, contrast=0.4, hue=0.1, p=0.8)

Random Horizontal Flip(p=0.5)

Random Grayscale(p=0.2)

Random Gaussian Blur(kernel size=(3, 3), sigma=(0.1, 2.0), p=0.5)

DUET: 2D Structured and Approximately Equivariant Representations

When we learn structure for groups that are not directly parameterized in this stack, we add a specific transformation. For example, for the vertical flip group we add Random Vertical Flip(p=0.5). Or for rotations, we add a random rotation transformation in the stack.

H. Training Procedure

For all our experiments we use as backbone a Res Net-32 (He et al., 2016) architecture with an input kernel of 3 3 and stride of 1. The output dimensionality of the Res Net is R512, which we reshape to R64 8 for a group granularity of G = 8. Note that we do not add parameters, we only reshape the output of a vanilla Res Net to build our DUET representations. Additional training parameters are shown in Table 5.

The detached decoders in Section 5.2 are also trained using the same procedure. The reconstructed images are RGB with 32 32 pixels. The decoder architecture is a Res Net-18 with swish activation functions, visual attention and Group Norm (Wu & He, 2018) normalization.

Table 5. Training parameters.

Batch size 2048 Epochs 800 Input images RGB of 32 32 Learning rate 0.0001 Learning rate warm-up 10 epochs Learning rate schedule Cosine Optimizer Adam(β = [0.9, 0.95]) Weight decay 0.0001

H.1. Effect of λ, σ and G

We perform a sweeping of λ values between 0 and 1000. The first observation is that adding structure improves over Sim CLR for all transformations (see Figure 10 in the Appendix). However, color transformations and horizontal flips degrade performance if we strongly impose structure. This result hints that structure for such groups is harder to learn, or is less learnable from data (e.g., the structure is ambiguous, as in the case of having flipped and non-flipped images in the dataset). Interestingly, DUETλ=0 also improves slightly over Sim CLR, showing that unsupervised structure is still helpful for the specific case of CIFAR-10. Overall, our results show that λ = 10 is optimal for all transformations but scale, rotations and vertical flips which can handle up to λ = 1000.

In Figure 9 we show the accuracy of Sim CLR and DUET at different σ for all transformations analyzed. The violin plots show the median accuracy across transformations. We obtain an empirically optimal value of σ = 0.2 for DUET. Note that σ = 10 is almost equivalent to a uniform target, thus not imposing any structure. In Figure 11 we show the detailed results per transformation, observing that horizontal flip behaves better with a uniform target. Indeed, as observed in Section 5.1 and Section 5.3, with the datasets used horizontal flip is ambiguous and we cannot learn this symmetry from data.

Lastly, we found DUET is quite insensitive to the choice of parameter G, based on results on CIFAR-10. We sweep G = 2, 4, 8, 16 and the obtained accuracy changes by less than 1%. We choose to use a reasonable value of G = 8.

DUET: 2D Structured and Approximately Equivariant Representations

Figure 9. Sweep of DUET s parameter σ. We find empirically that σ 0.2 works best.

Figure 10. Sweep of DUET s parameter λ. Note that different transformations require different optimal λs.

Figure 11. Test top-1 performance on CIFAR-10 as we modify the σ parameter in DUET. We report here the results per group.

DUET: 2D Structured and Approximately Equivariant Representations

I. Additional Results for Transfer Learning

These results complement those summarized in Section 5.4. We train a logistic regression classifier on the representations of the training split of each dataset. No augmentations are applied during the classifier training. At test time, we evaluate the classifier on the test set of each dataset.

In Figure 12 we report the difference in accuracies between DUET and Sim CLR. DUET s structure to rotations yields a gain of +21% when transferring to Caltech101, and very important gains when transferring to other datasets like CIFAR-10, CIFAR-100, DTD or Pets. Structure to color transformations also proves beneficial, with a +6.36% gain on Flowers (grayscale), 7.05% on Food101 (Bossard et al., 2014) (hue) and 7.13% on CIFAR-100 (hue). Horizontal flip is the transformation that sees less gain, as expected given its ambiguity as shown in Figure 5.

It is interesting to see that DUET achieves slightly worse performance for rotations or flips on the Flowers dataset. Indeed, this dataset contains many circular flowers, which are rotation (or flip) invariant. In such situation, learning structure for rotations (or flips) should not give any gain. Actually, in DUET we are trading off content for structure, so if the structure learnt is not useful, we are actually diminishing the expressivity of the final representations.

Comparing with ESSL Figure 13, DUET achieves better transfer results for most of the datasets and transformations. Interestingly, ESSL improves over DUET for geometric transformations on Flowers, due to the trade-off inherent in DUET (see Section 6). For completeness, the results of ESSL compared to Sim CLR are shown in Figure 14.

The linear regression for ESSL with grayscale on CIFAR-100 did not converge, thus we removed that result from the plots.

Figure 12. Difference in accuracies between DUET and Sim CLR, when transferring representations learnt on Tiny Image Net to different datasets in the RRC+1 setting.

DUET: 2D Structured and Approximately Equivariant Representations

Figure 13. Difference in transfer accuracies between DUET and ESSL, when transferring representations learnt on Tiny Image Net to different datasets in the RRC+1 setting.

Figure 14. Difference in transfer accuracies between ESSL and Sim CLR, when transferring representations learnt on Tiny Image Net to different datasets in the RRC+1 setting.

DUET: 2D Structured and Approximately Equivariant Representations

J. Additional Results about Transformation Ambiguity

Figure 15. Observed P(g|x) for different transformations, obtained from 100 randomly sampled CIFAR-10 images. Note the inherent ambiguity for color transformations, in addition to the one observed for horizontal flips in Figure 5.(left). Also, see how the modes of the distributions correspond to the mapped points in Table 1.