# differentiable_random_partition_models__6e3da132.pdf

Differentiable Random Partition Models

Thomas M. Sutter , Alain Ryser , Joram Liebeskind, Julia E. Vogt Department of Computer Science ETH Zurich

Partitioning a set of elements into an unknown number of mutually exclusive subsets is essential in many machine learning problems. However, assigning elements, such as samples in a dataset or neurons in a network layer, to an unknown and discrete number of subsets is inherently non-differentiable, prohibiting endto-end gradient-based optimization of parameters. We overcome this limitation by proposing a novel two-step method for inferring partitions, which allows its usage in variational inference tasks. This new approach enables reparameterized gradients with respect to the parameters of the new random partition model. Our method works by inferring the number of elements per subset and, second, by filling these subsets in a learned order. We highlight the versatility of our generalpurpose approach on three different challenging experiments: variational clustering, inference of shared and independent generative factors under weak supervision, and multitask learning.

1 Introduction

Partitioning a set of elements into subsets is a classical mathematical problem that attracted much interest over the last few decades (Rota, 1964; Graham et al., 1989). A partition over a given set is a collection of non-overlapping subsets such that their union results in the original set. In machine learning (ML), partitioning a set of elements into different subsets is essential for many applications, such as clustering (Bishop and Svensen, 2004) or classification (De la Cruz-Mesía et al., 2007).

Random partition models (RPM, Hartigan, 1990) define a probability distribution over the space of partitions. RPMs can explicitly leverage the relationship between elements of a set, as they do not necessarily assume i.i.d. set elements. On the other hand, most existing RPMs are intractable for large datasets (Mac Queen, 1967; Plackett, 1975; Pitman, 1996) and lack a reparameterization scheme, prohibiting their direct use in gradient-based optimization frameworks.

In this work, we propose the differentiable random partition model (DRPM), a fully-differentiable relaxation for RPMs that allows reparametrizable sampling. The DRPM follows a two-stage procedure: first, we model the number of elements per subset, and second, we learn an ordering of the elements with which we fill the elements into the subsets. The DRPM enables the integration of partition models into state-of-the-art ML frameworks and learning RPMs from data using stochastic optimization.

We evaluate our approach in three experiments, demonstrating the proposed DRPM s versatility and advantages. First, we apply the DRPM to a variational clustering task, highlighting how the reparametrizable sampling of partitions allows us to learn a novel kind of Variational Autoencoder (VAE, Kingma and Welling, 2014). By leveraging potential dependencies between samples in a dataset, DRPM-based clustering overcomes the simplified i.i.d. assumption of previous works, which used categorical priors (Jiang et al., 2016). In our second experiment, we demonstrate how to retrieve sets of shared and independent generative factors of paired images using the proposed DRPM. In

Equal Contribution. Correspondence to {thomas.sutter,alain.ryser}@inf.ethz.ch

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0

n1 n2 n3 n4 n5

0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0

! 1 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0

Figure 1: Illustration of the proposed DRPM method. We first sample a permutation matrix π and a set of subset sizes n separately in two stages. We then use n and π to generate the assignment matrix Y , the matrix representation of a partition ρ.

contrast to previous works (Bouchacourt et al., 2018; Hosoya, 2018; Locatello et al., 2020), which rely on strong assumptions or heuristics, the DRPM enables end-to-end inference of generative factors. Finally, we perform multitask learning (MTL) by using the DRPM as a building block in a deterministic pipeline. We show how the DRPM learns to assign subsets of network neurons to specific tasks. The DRPM can infer the subset size per task based on its difficulty, overcoming the tedious work of finding optimal loss weights (Kurin et al., 2022; Xin et al., 2022).

To summarize, we introduce the DRPM, a novel differentiable and reparametrizable relaxation of RPMs. In extensive experiments, we demonstrate the versatility of the proposed method by applying the DRPM to clustering, inference of generative factors, and multitask learning.

2 Related Work

Random Partition Models Previous works on RPMs include product partition models (Hartigan, 1990), species sampling models (Pitman, 1996), and model-based clustering approaches (Bishop and Svensen, 2004). Further, Lee and Sang (2022) investigate the balancedness of subset sizes of RPMs. They all require tedious manual adjustment, are non-differentiable, and are, therefore, unsuitable for modern ML pipelines. A fundamental RPM application is clustering, where the goal is to partition a given dataset into different subsets, the clusters. In contrast to many existing approaches (Yang et al., 2019; Sarfraz et al., 2019; Cai et al., 2022), we consider cluster assignments as random variables, allowing us to treat clustering from a variational perspective. Previous works in variational clustering (Jiang et al., 2016; Dilokthanakul et al., 2016; Manduchi et al., 2021) implicitly define RPMs to perform clustering. They compute partitions in a variational fashion by making i.i.d. assumptions about the samples in the dataset and imposing soft assignments of the clusters to data points during training. A problem related to set partitioning is the earth mover s distance problem (EMD, Monge, 1781; Rubner et al., 2000). However, EMD aims to assign a set s elements to different subsets based on a cost function and given subset sizes. Iterative solutions to the problem exist (Sinkhorn, 1964), and various methods have recently been proposed, e.g., for document ranking (Adams and Zemel, 2011) or permutation learning (Santa Cruz et al., 2017; Mena et al., 2018; Cuturi et al., 2019).

Differentiable and Reparameterizable Discrete Distributions Following the proposition of the Gumbel-Softmax trick (GST, Jang et al., 2016; Maddison et al., 2017), interest in research around continuous relaxations for discrete distributions and non-differentiable algorithms rose. The GST enabled the reparameterization of categorical distributions and their integration into gradientbased optimization pipelines. Based on the same trick, Sutter et al. (2023) propose a differentiable formulation for the multivariate hypergeometric distribution. Multiple works on differentiable sorting procedures and permutation matrices have been proposed, e.g., Linderman et al. (2018); Prillo and Eisenschlos (2020); Petersen et al. (2021). Further, Grover et al. (2019) described the distribution over permutation matrices p(π) for a permutation matrix π using the Plackett-Luce distribution (PL, Luce, 1959; Plackett, 1975). Prillo and Eisenschlos (2020) proposed a computationally simpler variant of Grover et al. (2019). More examples of differentiable relaxations include the top-k elements selection procedure (Xie and Ermon, 2019), blackbox combinatorial solvers (Poganˇci c et al., 2019), implicit likelihood estimations (Niepert et al., 2021), and k-subset sampling (Ahmed et al., 2022).

3 Preliminaries

Set Partitions A partiton ρ = (S1, . . . , SK) of a set [n] = {1, . . . , n} with n elements is a collection of K subsets Sk [n] where K is a priori unknown (Mansour and Schork, 2016). For a partition ρ to be valid, it must hold that S1 SK = [n] and k = l : Sk Sl = (1) In other words, every element i [n] has to be assigned to precisely one subset Sk. We denote the size of the k-th subset Sk as nk = |Sk|. Alternatively, we can describe a partition ρ through an assignment matrix Y = [y1, . . . , y K]T {0, 1}K n. Every row yk {0, 1}1 n is a multi-hot vector, where yki = 1 assigns element i to subset Sk.

Within the scope of our work, we view a partition of a set of n elements as a special case of the urn model. Here, the urn contains marbles with n different colors, where each color corresponds to a subset in the partition. For each color, there are n marbles corresponding to the potential elements of their color/subset. To derive a partition, we sample n marbles without replacement from the urn and register the order in which we draw the colors. The color of the i-th marble then determines the subset to which element i corresponds. Furthermore, we can constrain the partition to only K subsets by taking an urn with only K different colors.

Probability distribution over subset sizes The multivariate non-central hypergeometric distribution (MVHG) describes sampling without replacement and allows to skew the importance of groups with an additional importance parameter ω (Fisher, 1935; Wallenius, 1963; Chesson, 1976). The MVHG is an urn model and is described by the number of different groups K N, the number of elements in the urn of every group m = [m1, . . . , m K] NK, the total number of elements in the urn PK k=1 mk N, the number of samples to draw from the urn n N0, and the importance factor for every group ω = [ω1, . . . , ωK] RK 0+ (Johnson, 1987). Then, the probability of sampling n = {n1, . . . , n K}, where nk describes the number of elements drawn from group K is

p(n; ω, m) = 1

where P0 is a normalization constant. Hence, the MVHG p(n; ω, m) allows us to model dependencies between different elements of a set since drawing one element from the urn influences the probability of drawing one of the remaining elements, creating interdependence between them. For the rest of the paper, we assume mk m : mk = n. We thus use the shorthand p(n; ω) to denote the density of the MVHG. We refer to Appendix A.1 for more details.

Probability distribution over Permutation Matrices Let p(π) denote a distribution over permutation matrices π {0, 1}n n. A permutation matrix π is doubly stochastic (Marcus, 1960), meaning that its row and column vectors sum to 1. This property allows us to use π to describe an order over a set of n elements, where πij = 1 means that element j is ranked at position i in the imposed order. In this work, we assume p(π) to be parameterized by scores s Rn +, where each score si corresponds to an element i. The order given by sorting s in decreasing order corresponds to the most likely permutation in p(π; s). Sampling from p(π; s) can be achieved by resampling the scores as si = β log si + gi where gi Gumbel(0, β) for fixed scale β, and sorting them in decreasing order. Hence, resampling scores s enables the resampling of permutation matrices π. The probability over orderings p(π; s) is then given by (Thurstone, 1927; Luce, 1959; Plackett, 1975; Yellott, 1977)

p(π; s) = p((π s)1 (π s)n) = (πs)1

Z (πs)2 Z (πs)1 (πs)n Z Pn 1 j=1 (πs)j (3)

where π is a permutation matrix and Z = Pn i=1 si. The resulting distribution is a Plackett-Luce (PL) distribution (Luce, 1959; Plackett, 1975) if and only if the scores s are perturbed with noise drawn from Gumbel distributions with identical scales (Yellott, 1977). For more details, we refer to Appendix A.2).

4 A two-stage Approach to Random Partition Models

We propose the DRPM p(Y ; ω, s), a differentiable and reparameterizable two-stage Random Partition Model (RPM). The proposed formulation separately infers the number of elements i per subset

n NK 0 , where PK k=1 nk = n, and the assignment of elements to subsets Sk by inducing an order on the n elements and filling S1, ..., SK sequentially in this order. To model the order of the elements, we use a permutation matrix π = [π1, . . . , πn]T {0, 1}n n, from which we infer Y by sequentially summing up rows according to n. Note that the doubly-stochastic property of all permutation matrices π ensures that the columns of Y remain one-hot vectors, assigning every element i to precisely one of the K subsets. At the same time, the k-th row of Y corresponds to an nk-hot vector yk and therefore serves as a subset selection vector, i.e.

i=νk+1 πi, where νk =

such that Y = [y1, . . . , y K]T . Additionally, Figure 1 provides an illustrative example. Note that K defines the maximum number of possible subsets, and not the effective number of non-empty subsets, because we allow Sk to be the empty set (Mansour and Schork, 2016). We base the following Proposition 4.1 on the MVHG distribution p(n; ω) for the subset sizes n and the PL distribution p(π; s) for assigning the elements to subsets. However, the proposed two-stage approach to RPMs is not restricted to these two classes of probability distributions. Proposition 4.1 (Two-stage Random Partition Model). Given a probability distribution over subset sizes p(n; ω) with n NK 0 and distribution parameters ω RK + and a PL probability distribution over random orderings p(π; s) with π {0, 1}n n and distribution parameters s Rn +, the probability mass function p(Y ; ω, s) of the two-stage RPM is given by

p(Y ; ω, s) = p(y1, . . . , y K; ω, s) = p(n; ω) X

π ΠY p(π; s) (5)

where ΠY = {π : yk = Pνk+nk i=νk+1 πi, k = 1, . . . , K}, and yk and νk as in Equation (4).

In the following, we outline the proof of Proposition 4.1 and refer to Appendix B for a formal derivation. We calculate p(Y ; ω, s) as a probability of subsets p(y1, . . . , y K; ω, s), which we compute sequentially over subsets, i.e.

p(y1, . . . , y K; ω, s) = p(y1; ω, s) p(y K | y<K; ω, s), (6)

where y<k = [y1, . . . , yk 1] and

p(yk | y<k; ω, s) = p(nk | n<k; ω) X

π Πyk p( π | nk, y<k; s), (7)

where Πyk in Equation (7) is the set of all subset permutations of elements i Sk. A subset permutation matrix π represents an ordering over only nk out of the total n elements. The probability p(yk | y<k; ω, s) describes the probability of a subset of a given size nk by marginalizing over the probabilities of all subset permutations p( π | nk, y<k; s). Hence, the sum over all p( π | nk, y<k; s) makes p(yk | y<k; ω, s) invariant to the ordering of elements i Sk (Xie and Ermon, 2019). Note that in a slight abuse of notation, we use p( π | nk, y<k; ω, s) as the probability of a subset permutation π given that there are nk elements in Sk and thus π {0, 1}nk n.

The probability of a subset permutation matrix p( π | nk, y<k; s) describes the probability of drawing the elements i Sk in the order defined by the subset permutation matrix π given that the elements in S<k are already determined. Hence, we condition on the subsets y<k. This property follows from Luce s choice axiom (LCA, Luce, 1959). Additionally, we condition on nk, the size of the subset Sk. The probability of a subset permutation is given by

p( π | nk, y<k; s) =

( πs)i Zk Pi 1 j=1( πs)j (8)

In contrast to the distribution over permutations matrices p(π; s) in Equation (3), we compute the product over nk terms and have a different normalization constant Zk, which is the sum over the scores si of all elements i Sk. Although we induce an ordering over all elements i by using a permutation matrix π, the probability p(yk | y<k; ω, s) is invariant to intra-subset orderings of elements i Sk. Finally, we arrive at Equation (5) by substituting Equation (7) into Equation (6),

and applying the definition of the conditional probability p(n; ω) = QK k=1 p(nk | n<k; ω) and by reshuffling indices P

π ΠY p(π; s) = QK k=1 P

π Πyk p( π | nk, y<k; s).

Note that in contrast to previous RPMs, which often need exponentially many distribution parameters (Plackett, 1975), the proposed two-stage approach to RPMs only needs (n + K) parameters to create an RPM for n elements: the score parameters s Rn + and the group importance parameters ω RK + .

Finally, to sample from the two-stage RPM of Proposition 4.1 we apply the following procedure: First sample π p(π; s) and n p(n; ω). From π and n, compute partition Y by summing the rows of π according to n as described in Equation (4) and illustrated in Figure 1.

4.1 Approximating the Probability Mass Function

The number of permutations per subset |Πyk| scales factorially with the subset size nk, i.e. |Πyk| = nk!. Consequently, the number of valid permutation matrices |ΠY | is given as a function of n, i.e.

k=1 |Πyk| =

k=1 nk! (9)

Although Proposition 4.1 describes a well-defined distribution for p(Y ; ω, s), it is in general computationally intractable due to Equation (9). In practice, we thus approximate p(Y ; ω, s) using the following Lemma.

Lemma 4.2. p(Y ; ω, s) can be upper and lower bounded as follows

π ΠY : p(n; ω)p(π; s) p(Y ; ω, s) |ΠY |p(n; ω) max π p( π; s) (10)

We provide the proof in Appendix B. Note that from Equation (3) we see that max π p( π; s) = p(πs; s), where πs is the permutation that results from sorting the unperturbed scores s.

4.2 The Differentiable Random Partition Model

To incorporate our two-stage RPM into gradient-based optimization frameworks, we require that efficient computation of gradients is possible for every step of the method. The following Lemma guarantees differentiability, allowing us to train deep neural networks with our method in an end-toend fashion:

Lemma 4.3 (DRPM). A two-stage RPM is differentiable and reparameterizable if the distribution over subset sizes p(n; ω) and the distribution over orderings p(π; s) are differentiable and reparameterizable.

We provide the proof in Appendix B. Note that Lemma 4.3 enables us to learn variational posterior approximations and priors using Stochastic Gradient Variational Bayes (SGVB, Kingma and Welling, 2014). In our experiments, we apply Lemma 4.3 using the recently proposed differentiable formulations of the MVHG (Sutter et al., 2023) and the PL distribution (Grover et al., 2019), though other choices would also be valid.

5 Experiments

We demonstrate the versatility and effectiveness of the proposed DRPM in three different experiments. First, we propose a novel generative clustering method based on the DRPM, which we compare against state-of-the-art variational clustering methods and demonstrate its conditional generation capabilities. Then, we demonstrate how the DRPM can infer shared and independent generative factors under weak supervision. Finally, we apply the DRPM to multitask learning (MTL), where the DRPM enables an adaptive neural network architecture that partitions layers based on task difficulty2.

2We provide the code under https://github.com/thomassutter/drpm

Table 1: We compare the clustering performance of the DRPM-VC on test sets of MNIST and FMNIST between Gaussian Mixture Models (GMM), GMM in latent space (Latent GMM), and Variational Deep Embedding (Va DE). We measure performance in terms of the Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and cluster accuracy (ACC) over five seeds and put the best model in bold.

MNIST FMNIST

NMI ARI ACC NMI ARI ACC

GMM 0.32 0.01 0.22 0.02 0.41 0.01 0.49 0.01 0.33 0.00 0.44 0.01 LATENT GMM 0.86 0.02 0.83 0.06 0.88 0.07 0.60 0.00 0.47 0.01 0.62 0.01 VADE 0.84 0.01 0.76 0.05 0.82 0.04 0.56 0.02 0.40 0.04 0.56 0.03 DRPM-VC 0.89 0.01 0.88 0.03 0.94 0.02 0.64 0.00 0.51 0.01 0.65 0.00

Figure 2: A sample drawn from a DRPM-VC model trained on FMNIST. On top is the sampled partition with the cluster assignments, and on the bottom are generated images corresponding to the sampled assignment matrix. The DRPM-VC learns consistent clusters for different pieces of clothing and can generate new samples of each cluster with great variability.

5.1 Variational Clustering with Random Partition Models

In our first experiment, we introduce a new version of a Variational Autoencoder (VAE, Kingma and Welling, 2014), the DRPM Variational Clustering (DRPM-VC) model. The DRPM-VC enables clustering and unsupervised conditional generation in a variational fashion. To that end, we assume that each sample x of a dataset X is generated by a latent vector z Rl, where l N is the latent space size. Traditional VAEs would then assume that all latent vectors z are generated by a single Gaussian prior distribution N(0, Il). Instead, we assume every z to be sampled from one of K different latent Gaussian distributions N(µk, diag(σk)), k = 1, . . . , K, with µk, σk Rl. Further, note that similar to an urn model (Section 3), if we draw a batch from a given finite dataset with samples from different clusters, the cluster assignments within that batch are not entirely independent. Since there is only a finite number of samples per cluster, drawing a sample from a specific cluster decreases the chance of drawing a sample from that cluster again, and the distribution of the number of samples drawn per cluster will follow an MVHG distribution. Previous work on variational clustering proposes to model the cluster assignment y {0, 1}K of each sample x through independent categorical distributions (Jiang et al., 2016), which might thus be over-restrictive and not correctly reflect reality. Instead, we propose explicitly modeling the dependency between the y of different samples by assuming they are drawn from an RPM. Hence, the generative process leading to X can be summarized as follows: First, the cluster assignments are represented as a partition matrix Y and sampled from our DRPM, i.e., Y p(Y ; ω, s). Given an assignment y from Y , we can sample the respective latent variable z, where z N(µy, diag(σy)), z Rl. Note that we use the notational shorthand µy := µarg max(y). Like in vanilla VAEs, we infer x by independently passing the corresponding z through a decoder model. Assuming this generative process, we derive the following evidence lower bound (ELBO) for p(X):

x X Eq(z|x) [log p(x|z)] X

x X Eq(Y |X) [KL[q(z|x)||p(z|Y )]] KL[q(Y |X)||p(Y )]

Note that computing KL[q(Y |X)||p(Y )] directly is computationally intractable, and we need to upper bound it according to Lemma 4.2. For an illustration of the generative assumptions and more details on the ELBO, we refer to Appendix C.2.

Table 2: Partitioning of Generative Factors. We evaluate the learned latent representations of the four methods (Label-VAE, Ada-VAE, HG-VAE, DRPM-VAE) with respect to the shared (S) and independent (I) generative factors. We do this by fitting linear classifiers on the shared and independent dimensions of the representation, predicting the respective generative factors. We report the results in adjusted balanced accuracy (Sutter et al., 2023) across five seeds.

ns = 0 ns = 1 ns = 3 ns = 5

I S I S I S I

LABEL 0.14 0.01 0.19 0.03 0.16 0.01 0.10 0.00 0.23 0.01 0.34 0.00 0.00 0.00

ADA 0.12 0.01 0.19 0.01 0.15 0.01 0.10 0.03 0.22 0.02 0.33 0.03 0.00 0.00

HG 0.18 0.01 0.22 0.05 0.19 0.01 0.08 0.02 0.28 0.01 0.28 0.01 0.01 0.00

DRPM 0.26 0.02 0.39 0.07 0.2 0.01 0.15 0.01 0.29 0.02 0.42 0.03 0.01 0.00

To assess the clustering performance, we train our model on two different datasets, namely MNIST (Le Cun et al., 1998) and Fashion-MNIST (FMNIST, Xiao et al., 2017), and compare it to three baselines. Two of the baselines are based on a Gaussian Mixture Model, where one is directly trained on the original data space (GMM), whereas the other takes the embeddings from a pretrained encoder as input (Latent GMM). The third baseline is Variational Deep Embedding (Va DE, Jiang et al., 2016), which is similar to the DRPM-VC but assumes i.i.d. categorical cluster assignments. For all methods except GMM, we use the weights of a pretrained encoder to initialize the models and priors at the start of training. We present the results of these experiments in Table 1. As can be seen, we outperform all baselines, indicating that modeling the inherent dependencies implied by finite datasets benefits the performance of variational clustering. While achieving decent clustering performance, another benefit of variational clustering methods is that their reconstruction-based nature intrinsically allows unsupervised conditional generation. In Figure 2, we present the result of sampling a partition and the corresponding generations from the respective clusters after training the DRPM-VC on FMNIST. The model produces coherent generations despite not having access to labels, allowing us to investigate the structures learned by the model more closely. We refer to Appendix C.2 for more illustrations of the learned clusters, details on the training procedure, and ablation studies.

5.2 Variational Partitioning of Generative Factors

Figure 3: The mean squared errors between the estimated number of shared factors ˆns and the true number of shared factors ns across five seeds for the Label VAE, Ada-VAE, HG-VAE, and DRPMVAE.

Data modalities not collected as i.i.d. samples, such as consecutive frames in a video, provide a weak-supervision signal for generative models and representation learning (Sutter et al., 2023). Here, on top of learning meaningful representations of the data samples, we are also interested in discovering the relationship between coupled samples. If we assume that the data is generated from underlying generative factors, weak supervision comes from the fact that we know that certain factors are shared between coupled pairs while others are independent. The supervision is weak because we neither know the underlying generative factors nor the number of shared and independent factors. In such a setting, we can use the DRPM to learn a partition of the generative factors and assign them to be either shared or independent.

In this experiment, we use paired frames X = [x1, x2] from the mpi3d dataset (Gondal et al., 2019). Every pair of frames shares a subset of its seven generative factors. We introduce the DRPM-VAE, which models the division of the latent space into shared and independent latent factors as RPM. We add a posterior approximation q(Y | X) and additionally a prior distribution of the form p(Y ). The model maximizes the following ELBO on the

marginal log-likelihood of images through a VAE (Kingma and Welling, 2014):

j=1 Eq(zs,zj,Y |X) [log p(xj | zs, zj)] (11)

Eq(Y |X) [KL [q(zs, z1, z2 | Y, X)||p(zs, z1, z2)]] KL [q(Y | X)|| p(Y )]

Similar to the ELBO for variational clustering in Section 5.1, computing KL [q(Y | X)|| p(Y )] directly is intractable, and we need to bound it according to Lemma 4.2.

We compare the proposed DRPM-VAE to three methods, which only differ in how they infer shared and latent dimensions. While the Label-VAE (Bouchacourt et al., 2018; Hosoya, 2018) assumes that the number of independent factors is known, the Ada-VAE (Locatello et al., 2020) relies on a heuristic-based approach to infer shared and independent latent factors. Like in Locatello et al. (2020) and Sutter et al. (2023), we assume a single known factor for Label-VAE in all experiments. HG-VAE (Sutter et al., 2023) also relies on the MVHG to model the number of shared and independent factors. Unlike the proposed DRPM-VAE approach, HG-VAE must rely on a heuristic to assign latent dimensions to shared factors, as the MVHG only allows to model the number of shared and independent factors but not their position in the latent vector. We use the code from Locatello et al. (2020) and follow the evaluation in Sutter et al. (2023). We refer to Appendix C.3 for details on the ELBO, the setup of the experiment, the implementation, and an illustration of the generative assumptions.

We evaluate all methods according to their ability to estimate the number of shared generative factors (Figure 3) and how well they partition the latent representations into shared and independent factors (Table 2). Because we have access to the data-generating process, we can control the number of shared ns and independent ni factors. We compare the methods on four different datasets with ns {0, 1, 3, 5}. In Figure 3, we demonstrate that the DRPM-VAE accurately estimates the true number of shared generative factors. It matches the performance of HG-VAE and outperforms the other two baselines, which consistently overestimate the true number of shared factors. In Table 2, we see a considerable performance improvement compared to previous work when assessing the learned latent representations. We attribute this to our ability to not only estimate the subset sizes of latent and shared factors like HG-VAE but also learn to assign specific latent dimensions to the corresponding shared or independent representations. Thus, the DRPM-VAE dynamically learns more meaningful representations and can better separate and infer the shared and independent subspaces for all dataset versions.

The DRPM-VAE provides empirical evidence of how RPMs can leverage weak supervision signals by learning to maximize the data likelihood while also inferring representations that capture the relationship between coupled data samples. Additionally, we can explicitly model the data-generating process in a theoretically grounded fashion instead of relying on heuristics.

5.3 Multitask Learning

Many ML applications aim to solve specific tasks, where we optimize for a single objective while ignoring potentially helpful information from related tasks. Multitask learning (MTL) aims to improve the generalization across all tasks, including the original one, by sharing representations between related tasks (Caruana, 1993; Caruana and de Sa, 1996) Recent works (Kurin et al., 2022; Xin et al., 2022) show that it is difficult to outperform a convex combination of task losses if the task losses are appropriately scaled. I.e., in case of equal difficulty of the two tasks, a classifier with equal weighting of the two classification losses serves as an upper bound in terms of performance. However, finding suitable task weights is a tedious and inefficient approach to MTL. A more automated way of weighting multiple tasks would thus be vastly appreciated.

In this experiment, we demonstrate how the DRPM can learn task difficulty by partitioning a network layer. Intuitively, a task that requires many neurons is more complex than a task that can be solved using a single neuron. Based on this observation, we propose the DRPM-MTL. The DRPM-MTL learns to partition the neurons of the last shared layer such that only a subset of the neurons are used for every task. In contrast to the other experiments (Sections 5.1 and 5.2), we use the DRPM without resampling and infer the partition Y as a deterministic function. This can be done by applying the two-step procedure of Proposition 4.1 but skipping the resampling step of the MVHG and PL distributions. We compare the DRPM-MTL to the unitary loss scaling method (ULS, Kurin et al.,

2022), which has a fixed architecture and scales task losses equally. Both DRPM-MTL and ULS use a network with shared architecture up to some layer, after which the network branches into two task-specific layers that perform the classifications. Note the difference between the methods. While the task-specific branches of the ULS method access all neurons of the last shared layer, the task-specific branches of the DRPM-MTL access only the subset of neurons reserved for the respective task.

Figure 4: Results for noisy Multi MNIST experiment. In the upper plot, we compare the task accuracy of the two methods ULS and the DRPM-MTL. We see that the DRPM-MTL can reach higher accuracy for most of the different noise ratios α while it assigns the number of dimensions per task according to their difficulty.

We perform experiments on Multi MNIST (Sabour et al., 2017), which overlaps two MNIST digits in one image, and we want to classify both numbers from a single sample. Hence, the two tasks, classification of the left and the right digit (see Appendix C.4 for an example), are approximately equal in difficulty by default. To increase the difficulty of one of the two tasks, we introduce the noisy Multi MNIST dataset. There, we control task difficulty by adding salt and pepper noise to one of the two digits, subsequently increasing the difficulty of that task with increasing noise ratios. Varying the noise, we evaluate how our DRPM-MTL adapts to imbalanced difficulties, where one usually has to tediously search for optimal loss weights to reach good performance. We base our pipeline on (Sener and Koltun, 2018). For more details and additional Celeb A MTL experiments we refer to Appendix C.4.

We evaluate the DRPM-MTL concerning its classification accuracy on the two tasks and compare the inferred subset sizes per task for different noise ratios α {0.0, . . . , 0.9} of the noisy Multi MNIST dataset (see Figure 4). The DRPM-MTL achieves the same or better accuracy on both tasks for most noise levels (upper part of Figure 4). It is interesting to see that, the more we increase α, the more the DRPM-MTL tries to overcome the increased difficulty of the right task by assigning more dimensions to it (lower part of Figure 4, noise ratio α 0.6-0.8). Note that for the maximum noise ratio of α = 0.9, it seems that the DRPM-MTL basically surrenders and starts neglecting the right task, instead focusing on getting good performance on the left task, which impacts the average accuracy.

Limitations & Future Work

The proposed two-stage approach to RPMs requires distributions over subset sizes and permutation matrices. The memory usage of the permutation matrix used in the two-stage RPM increases quadratically in the number of elements n. Although we did not experience memory issues in our experiments, this may lead to problems when partitioning vast sets. Furthermore, learning subsets by first inferring an ordering of all elements can be a complex optimization problem. Approaches based on minimizing the earth mover s distance (Monge, 1781) to learn subset assignments could be an alternative to the ordering-based approach in our DRPM and pose an interesting direction for future work. Finally, note that we compute the probability mass function (PMF) p(Y ; ω, s) by approximating it with the bounds in Lemma 4.2. While the upper bound is tight when all scores have similar magnitude, the bound loosens if scores differ a lot, leading Equation (10) to overestimate the value of the PMF. In practice, we thus reweight the respective terms in the loss function, but in the future, we will investigate better estimates for the PMF.

Ultimately, we are interested in exploring how to apply the DRPM to multimodal learning under weak supervision, for instance, in medical applications. Section 5.2 demonstrated the potential of learning from coupled samples, but further research is needed to ensure fairness concerning underlying, hidden attributes when working with sensitive data.

In this work, we proposed the differentiable random partition model, a novel approach to random partition models. Our two-stage method enables learning partitions end-to-end by separately controlling subset sizes and how elements are assigned to subsets. This new approach to partition learning enables the integration of random partition models into probabilistic and deterministic gradient-based optimization frameworks. We show the versatility of the proposed differentiable random partition model by applying it to three vastly different experiments. We demonstrate how learning partitions enables us to explore the modes of the data distribution, infer shared and independent generative factors from coupled samples, and learn task-specific sub-networks in applications where we want to solve multiple tasks on a single data point.

Acknowledgements

AR is supported by the Stimu Loop grant #1-007811-002 and the Vontobel Foundation. TS is supported by the grant #2021-911 of the Strategic Focal Area Personalized Health and Related Technologies (PHRT) of the ETH Domain (Swiss Federal Institutes of Technology).

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor Flow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Co RR, abs/1603.0, 2016.

R. P. Adams and R. S. Zemel. Ranking via sinkhorn propagation. ar Xiv preprint ar Xiv:1106.1925, 2011.

K. Ahmed, Z. Zeng, M. Niepert, and G. V. d. Broeck. SIMPLE: A Gradient Estimator for k-Subset Sampling. In The Eleventh International Conference on Learning Representations, Sept. 2022.

Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432, 2013.

C. M. Bishop and M. Svensen. Robust Bayesian Mixture Modelling. To appear in the Proceedings of ESANN, page 1, 2004. Publisher: Citeseer.

D. Bouchacourt, R. Tomioka, and S. Nowozin. Multi-level variational autoencoder: Learning disentangled representations from grouped observations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

J. Cai, J. Fan, W. Guo, S. Wang, Y. Zhang, and Z. Zhang. Efficient deep embedded subspace clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1 10, 2022.

R. Caruana. Multitask learning: A knowledge-based source of inductive bias. In Machine learning: Proceedings of the tenth international conference, pages 41 48, 1993.

R. Caruana and V. R. de Sa. Promoting poor features to supervisors: Some inputs work better as outputs. Advances in Neural Information Processing Systems, 9, 1996.

J. Chesson. A non-central multivariate hypergeometric distribution arising from biased sampling with application to selective predation. Journal of Applied Probability, 13(4):795 797, 1976. Publisher: Cambridge University Press.

A. Coates, A. Ng, and H. Lee. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 215 223. JMLR Workshop and Conference Proceedings, June 2011. ISSN: 1938-7228.

M. Cuturi, O. Teboul, and J.-P. Vert. Differentiable Ranks and Sorting using Optimal Transport, Nov. 2019. ar Xiv:1905.11885 [cs, stat].

R. De la Cruz-Mesía, F. A. Quintana, and P. Müller. Semiparametric Bayesian Classification with longitudinal Markers. Journal of the Royal Statistical Society Series C: Applied Statistics, 56(2): 119 137, Mar. 2007. ISSN 0035-9254. doi: 10.1111/j.1467-9876.2007.00569.x.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep unsupervised clustering with Gaussian mixture variational autoencoders. ar Xiv preprint ar Xiv:1611.02648, 2016.

R. A. Fisher. The logic of inductive inference. Journal of the royal statistical society, 98(1):39 82, 1935. Publisher: JSTOR.

A. Fog. Calculation methods for Wallenius noncentral hypergeometric distribution. Communications in Statistics Simulation and Computation , 37(2):258 273, 2008. Publisher: Taylor & Francis.

M. W. Gondal, M. Wuthrich, D. Miladinovic, F. Locatello, M. Breidt, V. Volchkov, J. Akpo, O. Bachem, B. Schölkopf, and S. Bauer. On the Transfer of Inductive Bias from Simulation to the Real World: a New Disentanglement Dataset. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

R. L. Graham, D. E. Knuth, O. Patashnik, S. L.-C. i. Physics, and undefined 1989. Concrete mathematics: a foundation for computer science. aip.scitation.org, 3(5):165, 1989. doi: 10.1063/1. 4822863. Publisher: AIP Publishing.

A. Grover, E. Wang, A. Zweig, and S. Ermon. Stochastic Optimization of Sorting Networks via Continuous Relaxations. In International Conference on Learning Representations, 2019.

J. A. Hartigan. Partition models. Communications in Statistics - Theory and Methods, 19(8): 2745 2756, 1990. doi: 10.1080/03610929008830345. Publisher: Taylor & Francis.

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition, Dec. 2015.

I. Higgins, L. Matthey, A. Pal, C. Burgess, and X. Glorot. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016.

H. Hosoya. A simple probabilistic deep generative model for learning generalizable disentangled representations from grouped data. Co RR, abs/1809.0, 2018.

E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016.

Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. ar Xiv preprint ar Xiv:1611.05148, 2016.

M. E. Johnson. Multivariate statistical simulation: A guide to selecting and generating continuous multivariate distributions, volume 192. John Wiley & Sons, 1987.

D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations, 2014.

V. Kurin, A. D. Palma, I. Kostrikov, S. Whiteson, and M. P. Kumar. In Defense of the Unitary Scalarization for Deep Multi-Task Learning. Co RR, abs/2201.04122, 2022.

Y. Le Cun, C. Cortes, and C. J. Burges. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278 2324, 1998. Issue: 11.

C. J. Lee and H. Sang. Why the Rich Get Richer? On the Balancedness of Random Partition Models. ar Xiv preprint ar Xiv:2201.12697, 2022.

Y. Li, M. Yang, D. Peng, T. Li, J. Huang, and X. Peng. Twin Contrastive Learning for Online Clustering. International Journal of Computer Vision, 130(9):2205 2221, Sept. 2022. ISSN 0920-5691, 1573-1405. doi: 10.1007/s11263-022-01639-z. ar Xiv:2210.11680 [cs].

S. W. Linderman, G. E. Mena, H. J. Cooper, L. Paninski, and J. P. Cunningham. Reparameterizing the Birkhoff Polytope for Variational Permutation Inference. In A. J. Storkey and F. Pérez-Cruz, editors, International Conference on Artificial Intelligence and Statistics, {AISTATS} 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, volume 84 of Proceedings of Machine Learning Research, pages 1618 1627. PMLR, 2018.

Z. Liu, P. Luo, X. Wang, and X. Tang. Deep Learning Face Attributes in the Wild, Sept. 2015. ar Xiv:1411.7766 [cs].

F. Locatello, B. Poole, G. Rätsch, B. Schölkopf, O. Bachem, and M. Tschannen. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pages 6348 6359. PMLR, 2020.

I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization, Jan. 2019. ar Xiv:1711.05101 [cs, math].

R. D. Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 1959.

J. Mac Queen. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability, pages 281 297, 1967.

C. Maddison, A. Mnih, and Y. Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2017.

L. Manduchi, K. Chin-Cheong, H. Michel, S. Wellmann, and J. E. Vogt. Deep Conditional Gaussian Mixture Model for Constrained Clustering. Ar Xiv, abs/2106.0, 2021.

T. Mansour and M. Schork. Commutation relations, normal ordering, and Stirling numbers. CRC Press Boca Raton, 2016.

M. Marcus. Some Properties and Applications of Doubly Stochastic Matrices. The American Mathematical Monthly, 67(3):215 221, 1960. ISSN 00029890, 19300972. Publisher: Mathematical Association of America.

G. E. Mena, D. Belanger, S. W. Linderman, and J. Snoek. Learning Latent Permutations with Gumbel Sinkhorn Networks. In 6th International Conference on Learning Representations, {ICLR} 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018.

G. Monge. Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., pages 666 704, 1781.

M. Niepert, P. Minervini, and L. Franceschi. Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions. In Advances in Neural Information Processing Systems, volume 34, pages 14567 14579. Curran Associates, Inc., 2021.

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. De Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Py Torch: An Imperative Style, High-Performance Deep Learning Library. Co RR, abs/1912.0, 2019.

F. Petersen, C. Borgelt, H. Kuehne, and O. Deussen. Monotonic Differentiable Sorting Networks. In International Conference on Learning Representations, 2021.

J. Pitman. Some Developments of the Blackwell-Macqueen URN Scheme. Lecture Notes-Monograph Series, 30:245 267, 1996. ISSN 07492170. Publisher: Institute of Mathematical Statistics.

R. L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193 202, 1975. Publisher: Wiley Online Library.

M. V. Poganˇci c, A. Paulus, V. Musil, G. Martius, and M. Rolinek. Differentiation of Blackbox Combinatorial Solvers. In International Conference on Learning Representations, Sept. 2019.

S. Prillo and J. Eisenschlos. Soft Sort: A Continuous Relaxation for the argsort Operator. In Proceedings of the 37th International Conference on Machine Learning, {ICML} 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 7793 7802. PMLR, 2020.

G.-C. Rota. The Number of Partitions of a Set. The American Mathematical Monthly, 71(5):498, May 1964. ISSN 00029890. doi: 10.2307/2312585. Publisher: JSTOR.

Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover s distance as a metric for image retrieval. International journal of computer vision, 40(2):99, 2000. Publisher: Springer Nature BV.

S. Sabour, N. Frosst, and G. E. Hinton. Dynamic Routing Between Capsules. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

R. Santa Cruz, B. Fernando, A. Cherian, and S. Gould. Deeppermnet: Visual permutation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3949 3957, 2017.

S. Sarfraz, V. Sharma, and R. Stiefelhagen. Efficient parameter-free clustering using first neighbor relations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8934 8943, 2019.

O. Sener and V. Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.

R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876 879, 1964. Publisher: JSTOR.

T. M. Sutter, L. Manduchi, A. Ryser, and J. E. Vogt. Learning Group Importance using the Differentiable Hypergeometric Distribution. In The Eleventh International Conference on Learning Representations, 2023.

L. L. Thurstone. A law of comparative judgment. In Scaling, pages 81 92. Routledge, 1927.

K. T. Wallenius. Biased sampling; the noncentral hypergeometric probability distribution. Technical report, Stanford Univ Ca Applied Mathematics And Statistics Labs, 1963.

H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms, Sept. 2017. ar Xiv:1708.07747 [cs, stat].

S. M. Xie and S. Ermon. Reparameterizable subset sampling via continuous relaxations. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 19, pages 3919 3925, Macao, China, Aug. 2019. AAAI Press. ISBN 978-0-9992411-4-1.

D. Xin, B. Ghorbani, A. Garg, O. Firat, and J. Gilmer. Do Current Multi-Task Optimization Methods in Deep Learning Even Help? ar Xiv preprint ar Xiv:2209.11379, 2022.

X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu. Deep spectral clustering using dual autoencoder network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4066 4075, 2019.

J. I. Yellott. The relationship between Luce s choice axiom, Thurstone s theory of comparative judgment, and the double exponential distribution. Journal of Mathematical Psychology, 15(2): 109 144, 1977. Publisher: Elsevier.

A Preliminaries

A.1 Hypergeometric Distribution

This part is largely based on Sutter et al. (2023).

Suppose we have an urn with marbles in different colors. Let K N be the number of different classes or groups (e.g. marble colors in the urn), m = [m1, . . . , m K] NK describe the number of elements per class (e.g. marbles per color), N = PK k=1 m K be the total number of elements (e.g. all marbles in the urn) and n {0, . . . , N} be the number of elements (e.g. marbles) to draw. Then, the multivariate hypergeometric distribution describes the probability of drawing n = [n1, . . . , n K] NK marbles by sampling without replacement such that PK k=1 nk = n, where nk is the number of drawn marbles of class k.

In the literature, two different versions of the noncentral hypergeometric distribution exist, Fisher s (Fisher, 1935) and Wallenius (Wallenius, 1963; Chesson, 1976) distribution. Sutter et al. (2023) restrict themselves to Fisher s noncentral hypergeometric distribution due to limitations of the latter (Fog, 2008). Hence, we will also talk solely about Fisher s noncentral hypergeometric distribution.

Definition A.1 (Multivariate Fisher s Noncentral Hypergeometric Distribution (Fisher, 1935)). A random vector X follows Fisher s noncentral multivariate distribution, if its joint probability mass function is given by

P(N = n; ω) = p(n; ω) = 1

where P0 = X

(η1,...,ηK) S

The support S of the PMF is given by S = {n NK : k nk mk, PK k=1 nk = n} and n k = n! k!(n k)!.

The class importance ω is a crucial modeling parameter in applying the noncentral hypergeometric distribution (see (Chesson, 1976)).

A.1.1 Differentiable MVHG

Their reparameterizable sampling for the differentiable MVHG consists of three parts:

1. Reformulate the multivariate distribution as a sequence of interdependent and conditional univariate hypergeometric distributions.

2. Calculate the probability mass function of the respective univariate distributions.

3. Sample from the conditional distributions utilizing the Gumbel-Softmax trick.

Following the chain rule of probability, the MVHG distribution allows for sequential sampling over classes k. Every step includes a merging operation, which leads to biased samples compared to groundtruth non-differentiable sampling with equal class weights ω. Given that we intend to use the differentiable MVHG in settings where we want to learn the unknown class weights, we do not expect a negative effect from this sampling procedure. For details on how to merge the MVHG into a sequence of unimodal distributions, we refer to Sutter et al. (2023).

The probability mass function calculation is based on unnormalized log-weights, which are interpreted as unnormalized log-weights of a categorical distribution. The interpretation of the class-conditional unimodal hypergeometric distributions as categorical distributions allows applying the Gumbel Softmax trick (Jang et al., 2016; Maddison et al., 2017). Following the use of the Gumbel-Softmax trick, the class-conditional version of the hypergeometric distribution is differentiable and reparameterizable. Hence, the MVHG has been made differentiable and reparameterizable as well. Again, for details we refer to the original paper (Sutter et al., 2023).

A.2 Distribution over Random Orderings

Yellott (1977) show that the distribution over permutation matrices p(π; s) follows a Plackett Luce (PL) distribution (Plackett, 1975; Luce, 1959), if and only of the perturbed scores s are sampled independently from Gumbel distributions with identical scales. For each item i, sample gi Gumbel(0, β) independently with zero mean and and fixed scale β. Let s be the vector of Gumbel perturbed log-weights such that si = β log si + gi. Hence,

q( s1 sn) = s1

Z s2 Z s1 sn Z Pn 1 i=1 si (14)

We refer to Yellott (1977) or Grover et al. (2019) for the proof. However, Grover et al. (2019) provide only an adapted proof sketch from Yellott (1977). The probability of sampling element i first is given by its score si divided by the sum of all weights in the set

q( si) = si

For zi = log si, the right hand side of Equation (15) is equal to the softmax distribution softmax(zi) = exp(zi)/ P

j exp(zj) as already described in (Xie and Ermon, 2019). Hence, Equation (15) directly leads to the Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2017).

A.2.1 Differentiable Sorting

In the main text of the paper we rely on a differentiable function fπ( s), which sorts the resampled version of the scores s

π = fπ( s) = sort( s) (16)

Here, we summarise the findings from Grover et al. (2019) on how to construct such a differentiable sorting operator. As already mentioned in Section 2, there are multiple works on the topic (Prillo and Eisenschlos, 2020; Petersen et al., 2021; Mena et al., 2018), but we restrict ourselves to the work of Grover et al. (2019) as we see the differentiable generation of permutation matrices as a tool in our pipeline. Corollary A.2 (Permutation Matrix (Grover et al., 2019)). Let s = [s1, . . . , sn]T be a real-valued vector of length n. Let As denote the matrix of absolute pairwise differences of the elements of s such that As[i, j] = |si sj|. The permutation matrix π corresponding to sort(s) is given by:

π = 1 if j = arg max[(n + 1 2i)s As1] 0 otherwise (17)

where 1 denotes the column vector of all ones.

As we know, the arg max operator is non-differentiable which prohibits the direct use of Corollary A.2 for gradient computation. Hence, Grover et al. (2019) propose to replace the arg max operator with softmax to obtain a continuous relaxation π(τ) similar to the GS trick (Jang et al., 2016; Maddison et al., 2017). In particular, the ith row of π(τ) is given by:

π(τ)[i, :] = softmax[(n + 1 2i)s As1/τ] (18)

where τ > 0 is a temperature parameter. We adapted this section from Grover et al. (2019) and we also refer to their original work for more details on how to generate differentiable permutation matrices.

In this, work we remove the temperature parameter τ to reduce clutter in the notation. Hence, we only write π instead of π(τ), although it is still needed for the generation of the matrix π. For details on how we select the temperature parameter τ in our experiments, we refer to Appendix C.

B Detailed Derivation of the Differentiable Two-Stage Random Partition Model

B.1 Two-Stage Partition Model

We want to partition n elements [n] = {1, . . . , n} into K subsets {S1, . . . , SK} where K is a priori unknown.

Definition B.1 (Partition). A partition ρ of a set of elements [n] = {1, . . . , n} is a collection of subsets (S1, . . . , SK) such that

S1 SK = [n] and i = j : Si Sj = (19)

Put differently, every element i has to be assigned to precisely one subset Sk. We denote the size of the k-th subset Sk as nk = |Sk|. Alternatively, we describe a partition ρ as an assignment matrix Y = [y1, . . . , y K]T {0, 1}K n. Every row yk {0, 1}1 n is a multi-hot vector, where yki = 1 assigns element i to subset Sk.

In this work, we propose a new two-stage procedure to learn partitions. The proposed formulation separately infers the number of elements per subset nk and the assignment of elements to subsets Sk by inducing an order on the n elements and filling S1, ..., SK sequentially in this order. See Figure 1 for an example. Definition B.2 (Two-stage partition model). Let n = [n1, . . . , n K] NK 0 be the subset sizes in ρ, with N0 the set of natural numbers including 0 and PK k=1 nk = n, where n is the total number of elements. Let π {0, 1}n n be a permutation matrix that defines an order over the n elements. We define the two-stage partition model of n elements into K subsets as an assignment matrix Y = [y1, . . . , y K]T {0, 1}K n with

i=νk+1 πi, where νk =

ι=1 nι (20)

such that Y = [{yk | nk > 0}K k=1]T .

Note that in contrast to previous work on partition models (Mansour and Schork, 2016), we allow Sk to be the empty set . Hence, K defines the maximum number of possible subsets, not the effective number of non-empty subsets.

To model the order of the elements, we use a permutation matrix π = [π1, . . . , πn]T {0, 1}n n which is a square matrix where every row and column sums to 1. This doubly-stochastic property of all permutation matrices π (Marcus, 1960) thus ensures that the columns of Y remain one-hot vectors. At the same time, its rows correspond to nk-hot vectors yk in Definition B.2 and therefore serve as subset assignment vectors. Corollary B.3. A two-stage partition model Y , which follows Definition B.2, is a valid partition satisfying Definition B.1.

Proof. By definition, every row πi and column πj of π is a one-hot vector, hence every Pνk+nk i=νk+1 πi results in different, non-overlapping nk-hot encodings, ensuring Si Sj = i, j and i = j. Further, since nk-hot encodings have exactly nk entries with 1, we have Pνk+nk i=νk+1 Pn j=1 πij = nk.

Hence, since PK k=1 nk = n, every element i is assigned to a yk, ensuring S1 SK = [n].

B.2 Two-Stage Random Partition Models

An RPM p(Y ) defines a probability distribution over partitions Y . In this section, we derive how to extend the two-stage procedure from Definition B.2 to the probabilistic setting to create a two-stage RPM. To derive the two-stage RPM s probability distribution p(Y ), we need to model distributions over n and π. We choose the MVHG distribution p(n; ω) and the PL distribution p(π; s) (see Section 3).

We calculate the probability p(Y ; ω, s) sequentially over the probabilities of subsets pyk := p(yk | y<k; ω, s). pyk itself depends on the probability over subset permutations p πk := p( π | nk, y<k; s), where a subset permutation matrix π represents an ordering over nk out of n elements.

Definition B.4 (Subset permutation matrix π). A subset permutation matrix π {0, 1}nk n, where nk n, must fulfill

j=1 πij = 1 and j n :

We describe the probability distribution over subset permutation matrices p πk using Definition B.4 and Equation (3). Lemma B.5 (Probability over subset permutations p πk). The probability p πk of any subset permutation matrix π = [ π1, . . . , πnk]T {0, 1}nk n is given by

p πk := p( π | nk, y<k; s) =

( πs)i Zk Pi 1 j=1( πs)j (21)

where y<k = {y1, ..., yk 1}, Zk = Z P

j S<k sj and S<k = Sk 1 j=1 Sj.

Proof. We provide the proof for p π1, but it is equivalent for all other subsets. Without loss of generality, we assume that there are n1 elements in S1. Following Equation (3), the probability of a permutation matrix p(π; s) is given by

p(π; s) =(πs)1

Z (πs)2 Z (πs)1 (πs)n Z Pn 1 j=1 (πs)j (22)

At the moment, we are only interested in the ordering of the first n1 elements. The probability of the first n1 is given by marginalizing over the remaining n n1 elements:

p( π | n1; ω) = X

π Π1 p(π | s) (23)

where Π1 is the set of permutation matrices such that the top n1 rows select the elements in a specific ordering π {0, 1}n1 n, i.e. Π1 = {π : [π1, . . . , πn1]T = π}. It follows

p( π | n1; ω) = X

π Π1 p(π | s) (24)

(πs)i Z Pi 1 j=1(πs)j (25)

( πs)i Z Pi 1 j=1( πs)j

(πs)n1+i Z Pn1 j=1( πs)j Pi 1 j=1( πs)j (26)

( πs)i Z Pi 1 j=1( πs)j

(πs)n1+i Z1 Pi 1 j=1( πs)j (27)

where Z1 = Z Pn1 j=1( πs)j. It follows

p( π | n1; ω) =

( πs)i Z Pi 1 j=1( πs)j (28)

Lemma B.5 describes the probability of drawing the elements i Sk in the order described by the subset permutation matrix π given that the elements in S<k are already determined. Note that in a slight abuse of notation, we use p( π | nk, y<k; ω, s) as the probability of a subset permutation π given that there are nk elements in Sk and thus π {0, 1}nk n. Additionally, we condition on the subsets y<k and nk, the size of subset Sk. In contrast to the distribution over permutations matrices p(π; s) in Equation (3), we take the product over nk terms and have a different normalization constant Zk. Although we induce an ordering over all elements i in Definition B.2, the probability pyk is invariant to intra-subset orderings of elements i Sk. Lemma B.6 (Probability distribution pyk). The probability distribution over subset assignments pyk is given by

pyk := p(yk | y<k; ω, s) = p(nk | n<k; ω) X

π Πyk p( π | nk, y<k; s)

where Πyk = { π {0, 1}nk n : yk = Pnk i=1 πi} and p( π | nk, y<k; s) as in Lemma B.5.

Proof. We can proof the statement of Lemma B.6 as follows:

pyk = p(yk | y<k; ω, s)

n k p(yk, n k | y<k; ω, s) (29)

n k p(n k | y<k; ω, s)p(yk | n k, y<k; ω, s) (30)

n k p(n k | n<k; ω, s)p(yk | n k, y<k; s) (31)

= p(nk | n<k; ω, s)p(yk | nk, y<k; s) (32)

= p(nk | n<k; ω) X

π Πyk p( π | nk, y<k; s) (33)

Equation (29) holds by marginalization, where n k denotes the random variable that stands for the size of subset Sk. By Bayes rule, we can then derive Equation (30). The next derivations stem from the fact that we can compute n<k if y<k is given, as the assignments y<k hold information on the size of subsets S<k. More explicitly, ni = Pn j=1 yij. Further, yk is independent of ω if the size n k of subset Sk is given, leading to Equation (31). We further observe that p(yk | n k, y<k; s) is only non-zero, if n k = Pn i=1 yki = nk. Dropping all zero terms from the sum in Equation (31) thus results in Equation (32). Finally, by Definition B.2, we know that yk = Pνk+nk i=νk+1 πi, where νk = Pk 1 ι=1 nι and π {0, 1}n n a permutation matrix. Hence, in order to get yk given y<k, we need to marginalize over all permutations of the elements of yk given that the elements in y<k are already ordered, which corresponds exactly to marginalizing over all subset permutation matrices π such that yk = Pnk i=1 πi, resulting in Equation (33).

In Lemma B.6, we describe the set of all subset permutations π of elements i Sk by Πyk. Put differently, we make p(yk | y<k; ω, s) invariant to the ordering of elements i Sk by marginalizing over the probabilities of subset permutations p πk (Xie and Ermon, 2019).

Using Lemmas B.5 and B.6, we propose the two-stage random partition p(Y ; ω, s). Since Y = [y1, . . . , y K]T , we calculate p(Y ; ω, s), the PMF of the two-stage RPM, sequentially using Lemmas B.5 and B.6, where we leverage the PL distribution for permutation matrices p(π; s) to describe the probability distribution over subsets p(yk | y<k; ω, s).

Proposition 4.1 (Two-Stage Random Partition Model). Given a probability distribution over subset sizes p(n; ω) with n NK 0 and distribution parameters ω RK + and a PL probability distribution over random orderings p(π; s) with π {0, 1}n n and distribution parameters s Rn +, the probability mass function p(Y ; ω, s) of the two-stage RPM is given by

p(Y ; ω, s) = p(y1, . . . , y K; ω, s) = p(n; ω) X

π ΠY p(π; s) (34)

where ΠY = {π : yk = Pνk+nk i=νk+1 πi, k = 1, . . . , K}, and yk and νk as in Definition B.2.

Proof. Using Lemmas B.5 and B.6, we write

p(Y ) =p(y1, . . . , y K; ω, s) = p(y1; ω, s) p(y K | {yj}j<K; ω, s)

π1 Πy1 p( π1 | n1; s)

p(n K | {nj}j<K; ω) X

πK Πy K p( πK | {nj}j K; s)

=p(n1; ω) p(n K | {n K}j<K; ω)

π1 Πy1 p( π1 | n1; s) X

πK Πy K p( πK | {nj}j K; s)

πK Πy K p( π1 | n1; s) p( πK | {nj}j K; s)

π ΠY p(π | n; s) (38)

π ΠY p(π; s) (39)

B.3 Approximating the Probability Mass Function

Lemma 4.2. p(Y ; ω, s) can be upper and lower bounded as follows

π ΠY : p(n; ω)p(π; s) p(Y ; ω, s) |ΠY |p(n; ω) max π p( π; s) (40)

Proof. Since p(π; s) is a probability we know that π {0, 1}n n p(π; s) 0. Thus, it follows directly that:

π ΠY : p(Y ; ω, s) = p(n; ω) X

π ΠY p(π ; s) p(n; ω)p(π; s),

proving the lower bound of Lemma 4.2.

On the other hand, can prove the upper bound in Lemma 4.2 by:

p(Y ; ω, s) = p(n; ω) X

π ΠY p(π ; s)

π ΠY max π ΠY p(π; s)

=p(n; ω) max π ΠY p(π; s) X

=|ΠY | p(n; ω) max π ΠY p(π; s)

|ΠY | p(n; ω) max π p(π; s)

We can compute the maximum probability maxπ p(π; s) with the probability of the permutation matrix fπ(s), which sorts the unperturbed scores in decreasing order.

B.4 The Differentiable Random Partition Model

We propose the DRPM p(Y ; ω, s), a differentiable and reparameterizable two-stage RPM.

Lemma 4.3 (DRPM). A two-stage RPM is differentiable and reparameterizable if the distribution over subset sizes p(n; ω) and the distribution over orderings p(π; s) are differentiable and reparameterizable.

Proof. To prove that our two-stage RPM is differentiable we need to prove that we can compute gradients for the bounds in Lemma 4.2 and to provide a reparameterization scheme for the two-stage approach in Definition B.2.

Gradients for the bounds: Since we assume that p(n; ω) and p(π; s) are differentiable and reparameterizable, we only need to show that we can compute |ΠY | and max π p( π; s) in a differentiable manner to prove that the bounds in Lemma 4.2 are differentiable. By definition (see Section 4.1),

k=1 |Πyk| =

Hence, |ΠY | can be computed given a reparametrized version nk, which is provided by the reparametrization trick for the MVHG p(n; ω). Further, from Equation (14), we immediately see that the most probable permutation is given by the order induced by sorting the original, unperturbed scores s from highest to lowest. This implies that max π p( π; s) = p(πs; s), which we can compute due to p(πs; s) being differentiable according to our assumptions.

Reparametrization of the two-stage approach: Given reparametrized versions of n and π, we compute a partition as follows:

i=νk+1 πi, where νk =

ι=1 nι (41)

The challenge here is that we need to be able to backpropagate through nk, which appears as an index in the sum. Let αk = {0, 1}n, such that

(αk)i = 1 if νk < i νk+1 0 otherwise

Given such αk, we can rewrite Equation (41) with

i=1 (αk)iπi. (42)

While this solves the problem of propagating through sum indices, it is not clear how to compute αk in a differentiable manner. Similar to other works on continuous relaxations (Jang et al., 2016; Maddison et al., 2017), we can compute a relaxation of αk by introducing a temperature τ. Let us introduce auxiliary function f : N [0, 1]n, that maps an integer x to a vector with entries

fi(x; τ) = σ x i + ϵ

such that fi(x; τ) 0 if x i

τ < 0 and fi(x; τ) 1 if x i

τ 0. Note that σ( ) is the standard sigmoid function and ϵ << 1 is a small positive constant to break the tie at σ(0). We then compute an approximation of αk with

αk(τ) = f(νk; τ) f(νk 1; τ),

αk(τ) [0, 1]n. Then, for τ 0 we have αk(τ) αk. In practice, we cannot set τ = 0 since this would amount to a division by 0. Instead, we can apply the straight-through estimator (Bengio et al., 2013) to the auxiliary function f(x; τ) in order to get αk {0, 1}n and use it to compute Equation (42).

Note that in our experiments, we use the MVHG relaxation of Sutter et al. (2023) and can thus leverage that they return one-hot encodings for nk. This allows a different path for computing αk which circumvents introducing yet another temperature parameter altogether. We refer to our code in the supplement for more details.

Table 3: Total GPU hours per experiment. We report the cumulative training and testing hours to generate the results shown in the main part of this manuscript. We relied on our internal cluster infrastructure equipped with RTX2080Ti GPUs. Hence, we report the number of compute hours for this GPU-type.

Experiment Computation Time (h)

Clustering (Section 5.1) 100 Partitioning of Generative Factors (Section 5.2) 480 MTL (Section 5.3) 100

C Experiments

In the following, we describe each of our experiments in more detail and provide additional ablations. All our experiments were run on RTX2080Ti GPUs. Each run took 6h-8h (Variational Clustering), 4h-6h (Generative Factor Partitioning), or 1h (Multitask Learning) respectively. We report the training and test time per model. Please note that we can only report the numbers to generate the final results but not the development time.

Code Release The official code can be found under https://github.com/thomassutter/ drpm. Please note that the results reported in the main text slightly differ from the ones being generated from the official code. For the main paper, we based our own code for the experiments in Section 5.2 on the disentanglement_lib (Locatello et al., 2020). However, the library is based on Tensorflow v1 (Abadi et al., 2016), which makes it more and more difficult to maintain and install. Therefore, we decided to re-implement everything in Py Torch (Paszke et al., 2019).

While the metrics of our method and the baselines slightly change, the relative performance between them remains the same.

The code and results for the remaining two experiments in Sections 5.1 and 5.3 are the same as in the main text.

C.1 Approximation quality of Lemma 4.2

To provide intuitive understanding of the bounds introduced in Lemma 4.2, we present an experiment in this subsection to demonstrate the behavior of the upper and lower bounds. It is important to note that RPMs are discrete distributions, as the number of possible samples is finite for a given number of elements n and subsets K. Therefore, we can estimate the probability of a fixed partition Y under given ω and s by sampling M partitions from p(Y ; ω, s), counting the occurrences of Y in the samples, and dividing the count by M. As M approaches infinity, we can obtain the true probability mass function (PMF) p(Y ; ω, s) for every partition Y .

In our experiment, we set n = 5 and aim to evaluate the quality of our bounds for all possible subset combinations of 5 elements, we thus set K = n = 5. To obtain a reliable estimate of the true PMF, we set M = 108. From Lemma 4.2, we know that: π ΠY : p(n; ω)p(π; s) p(Y ; ω, s) |ΠY |p(n; ω) max π p( π; s)

Let us define p U(Y ; ω, s) := |ΠY |p(n; ω) max π p( π; s)

p L(Y ; ω, s) := max π ΠY p(n; ω)p(π; s)

In Figure 5, we present the estimated PMF along with the corresponding upper bounds (p U) and lower bounds (p L) for four different combinations of RPM parameters ω and s. We observe that when all scores s are equal, as in the priors of the experiments in Sections 5.1 and 5.2, p U(Y ; ω, s) approximates p(Y ; ω, s) well and serves as a reliable estimate of the PMF. However, when the scores vary, the upper bound becomes looser, particularly for lower probability partitions, as it is dominated by the term max π p( π; s). Although p L(Y ; ω, s) appears looser than p U(Y ; ω, s) for certain configurations of ω and s, it provides more consistent results across all hyperparameter combinations.

(a) ω = 1, s = 1

(b) ω U([0, 1]), s = 1

(c) ω = 1, s U([0, 1])

(d) ω U([0, 1]), s U([0, 1])

Figure 5: Partitions with n = 5 and K = 5 for different ω and s. Each point in the plots corresponds to one of the n K different partitions and their respective estimated probability mass and its upper/lower bounds according to Lemma 4.2.

C.2 Variational Clustering with Random Partition Models

C.2.1 Loss Function

As mentioned in Section 5.1, for a given dataset X with N samples, let Z and Y contain the respective latent vectors and cluster assignments for each sample in X. The generative process can then be summarized as follows: First, we sample the cluster assignments Y from an RPM, i.e., Y P(Y ; ω, s). Given Y , we can sample the latent variables Z, where for each y we have z N(µy, σT y Il), z Rl. Finally, we sample X by passing each z through a decoder like in vanilla VAEs. Using Bayes rule and Jensen s inequality, we can then derive the following evidence lower bound (ELBO):

log(p(X)) = log

Y p(X, Y, Z)d Z

log p(X|Z)p(Z|Y )p(Y )

:= LELBO(X)

We then assume that we can factorize the approximate posterior as follows:

q(Z, Y |X) = q(Y |X) Y

Note that while we do assume conditional independence between z given its corresponding x, we model q(Y |X) with the DRPM and do not have to assume conditional independence between

Figure 6: Generative model of the DRPM clustering model. Generative paths are marked with thin arrows, whereas inference is in bold.

different cluster assignments. This allows us to leverage dependencies between samples from the dataset. Hence, we can rewrite the ELBO as follows:

LELBO(X) =Eq(Z|X) [log(p(X|Z))]

Eq(Y |X) [KL[q(Z|X)||p(Z|Y )]]

KL[q(Y |X)||p(Y )]

x X Eq(z|x) [log p(x|z)]

x X Eq(Y |X) [KL[q(z|x)||p(z|Y )]]

KL[q(Y |X)||p(Y )]

See Figure 6 for an illustration of the generative process and the assumed inference model. Since computing P(Y ) and q(Y |X) is intractable, we further apply Lemma 4.2 to approximate the KLDivergence term in LELBO, leading to the following lower bound:

x X Eq(z|x) [log p(x|z)] (43)

x X Eq(Y |X) [KL[q(z|x)||p(z|Y )]] (44)

log |ΠY | q(n; ω(X))

p(n; ω)p(πY ; s)

log max π q( π; s(X)) , (46)

where πY is the permutation that lead to Y during the two-stage resampling process. Further, we want to control the regularization strength of the KL divergences similar to the β-VAE (Higgins et al., 2016). Since the different terms have different regularizing effects, we rewrite Equations (45)

Figure 7: Autoencoder architecture of the DRPM-VC model.

and (46) and weight the individual terms as follows, leading to our final loss:

x X Eq(z|x) [log p(x|z)] (47)

x X Eq(Y |X) [KL[q(z|x)||p(z|Y )]] (48)

+ γ Eq(Y |X)

log |ΠY | q(n; ω(X))

+ δ Eq(Y |X)

log max π q( π; s(X))

C.2.2 Architecture

The model for our clustering experiments is a relatively simple, fully-connected autoencoder with a structure as seen in Figure 7. We have a fully connected encoder E with three layers mapping the input to 500, 500, and 2000 neurons, respectively. We then compute each parameter by passing the encoder output through a linear layer and mapping to the respective parameter dimension in the last layer. In our experiments, we use a latent dimension size of l = 10 for MNIST and l = 20 for FMNIST, such that µ(x), σ(x) Rl. To understand the architecture choice for the DRPM parameters, let us first take a closer look at Equation (48). For each sample x, this term minimizes the expected KL divergence between its approximate posterior q(z|x) = N(µ(x), diag(σ(x))) and the prior at index y given by the partition Y sampled from the DRPM q(Y |X; s, ω), i.e., N(µy, diag(σy)). Ideally, the most likely partition should assign the approximate posterior to the prior that minimizes this KL divergence. We can compute such s(X) and ω(X) given the parameters of the approximate posterior and priors as follows:

xi X : si(xi) = u (K arg min k (KL[N(µ(xi), diag(σ(xi))||N(µk, diag(σk))]))

ω(X) = 1 |X|

( N(x|µk, diag(σk)) PK k =1 N(x|µk , diag(σk ))

where u is a scaling constant that controls the probability of sampling the most likely partition. Note that ω and s minimize Equation (48) if defined this way when given the distribution parameters of the approximate posterior and the priors. The only thing that is left unclear is how much u should scale the scores s. Ultimately, we leave u as a learnable parameter but detach the rest of the computation of s and ω from the computational graph to improve stability during training. Finally, once we resample z N(µ(x), σ(x)), we pass it through a fully connected decoder D with four layers mapping z to 2000, 500, and 500 neurons in the first three layers and then finally back to the input dimension in the last layer to end up with the reconstructed sample ˆx.

C.2.3 Training

As in vanilla VAEs, we can estimate the reconstruction term in Equation (47) with MCMC by applying the reparametrization trick (Kingma and Welling, 2014) to q(z|x) to sample M samples z(i) q(z|x) and compute their reconstruction error to estimate Equation (47). Similarly, we can sample from q(Y |X) L times to estimate the terms in Equations (48) to (50), such that we minimize

i=1 log p(x|z(i))

i=1 KL[q(z|x)||p(z|Y (i))]

i=1 log |ΠY (i)| q(n(i); ω(X))

i=1 log max π q( π; s(X))

p(πY (i); s)

In our experiments, we set M = 1 and L = 100 since the MVHG and PL distributions are not concentrated around their mean very well, and more Monte Carlo samples thus lead to better approximations of the expectation terms. We further set β = 1 for MNIST and β = 0.1 for FMNIST, and otherwise γ = 1, and δ = 0.01 for all experiments.

To resample n and π we need to apply temperature annealing (Grover et al., 2019; Sutter et al., 2023). To do this, we applied the exponential schedule that was originally proposed together with the Gumbel-Softmax trick (Jang et al., 2016; Maddison et al., 2017), i.e., τ = max(τfinal, exp( rt)), where t is the current training step and r is the annealing rate. For our experiments, we choose r = log(τfinal) log(τinit)

100000 in order to annealing over 100000 training step. Like Jang et al. (2016), we set τinit = 1 and τfinal = 0.5.

Similar to Jiang et al. (2016), we quickly realized that proper initialization of the cluster parameters and network weights is crucial for variational clustering. In our experiments, we pretrained the autoencoder structure by adapting the contrastive loss of (Li et al., 2022), as they demonstrated that their representations manage to retain clusters in low-dimensional space. Further, we also added a reconstruction loss to initialize the decoder properly. To initialize the prior parameters, we fit a GMM to the pretrained embeddings of the training set and took the resulting Gaussian parameters to initialize our priors. Note that we used the same initialization across all baselines. See Appendix C.2.4 for an ablation where we pretrain with only a reconstruction loss similar to what was proposed with the Va DE baseline.

To optimize the DRPM-VC in our experiments, we used the Adam W (Loshchilov and Hutter, 2019) optimizer with a learning rate of 0.0001 with a batch size of 256 for 1024 epochs. During initial experiments with the DRPM-VC, we realized that the pretrained weights of the encoder would often lose the learned structure in the first couple of training epochs. We suspect this to be an artifact of instabilities induced by temperature annealing. To deal with these problems, we decided to freeze the first three layers of the encoder when training the DRPM-VC, giving us much better results. See Appendix C.2.5 for an ablation where we applied the same optimization procedure to Va DE.

Finally, when training the Va DE baseline and the DRPM-VC on FMNIST, we often observe a local optimum where the prior distributions collapse and become identical. We can solve this problem by refitting the GMM in the latent space every 10 epochs and by using the resulting parameters to reinitialize the prior distributions.

C.2.4 Reconstruction Pretraining

While the results of our variational clustering method depend a lot on the specific pretraining, we want to demonstrate that improvements over the baselines do not depend on the chosen pretraining method. To that end, we repeat our experiments but initialize the weights of our model with an autoencoder that has been trained to minimize the mean squared error between the input and the reconstruction. This initialization procedure was originally proposed in (Jiang et al., 2016). We

Table 4: We compare the clustering performance of the DRPM-VC on test sets of MNIST and FMNIST between GMM in latent space (Latent GMM) and Variational Deep Embedding (Va DE) initializing weights using an autoencoder trained on a reconstruction objective. We measure performance in terms of the Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and cluster accuracy (ACC) over five seeds and put the best model in bold.

MNIST FMNIST

NMI ARI ACC NMI ARI ACC

LATENT GMM 0.75 0.00 0.66 0.01 0.75 0.01 0.56 0.02 0.41 0.03 0.57 0.02 VADE 0.77 0.02 0.62 0.04 0.69 0.04 0.53 0.07 0.35 0.08 0.47 0.09 DRPM-VC 0.74 0.00 0.67 0.01 0.75 0.02 0.59 0.01 0.47 0.02 0.62 0.01

Table 5: We compare the clustering performance of the DRPM-VC on test sets of MNIST and FMNIST between GMM in latent space (Latent GMM), and Variational Deep Embedding (Va DE) when freezing the encoder. We measure performance in terms of the Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and cluster accuracy (ACC) over five seeds and put the best model in bold.

MNIST FMNIST

NMI ARI ACC NMI ARI ACC

LATENT GMM 0.86 0.02 0.83 0.06 0.88 0.07 0.60 0.00 0.47 0.01 0.62 0.01 VADE 0.90 0.02 0.88 0.06 0.92 0.06 0.64 0.01 0.47 0.01 0.59 0.03 DRPM-VC 0.89 0.01 0.88 0.03 0.94 0.02 0.64 0.00 0.51 0.01 0.65 0.00

present the results of this ablation in Table 4. Simply minimizing the reconstruction error does not necessarily retain cluster structures in the latent space. Thus, it does not come as a surprise that overall results get about 10% to 20% worse across most metrics, especially for MNIST, while results on FMNIST only slightly decrease. However, we still beat the baselines across most metrics, suggesting that modeling the implicit dependencies between cluster assignments helps to improve variational clustering performance.

C.2.5 Baselines with fixed Encoder

For the experiments in the main text, we wanted to implement the Va DE baseline similar to the original method proposed in Jiang et al. (2016). This means, in contrast to our method, we used their optimization procedure, i.e., Adam with a learning rate of 0.002 with a decay of 0.95 every 10 steps, and did not freeze the encoder as we do for the DRPM-VC. To ensure our results do not stem from this minor discrepancy, we perform an ablation experiment on Va DE using the same optimizer and learning rate as with the DRPM-VC and freeze the encoder backbone. The results of this additional experiment can be found in Appendix C.2.5. As can be seen, Va DE results do improve when adjusting the optimization procedure in this way. However, we still match or improve upon the results of Va DE in most metrics, especially in ARI and ACC, suggesting purer clusters compared to Va DE. We suspect this is because we assign samples to fixed clusters when sampling from the DRPM, whereas Va DE performs soft assignments by marginalizing over a categorical distribution.

C.2.6 Additional Partition Samples

In Section 5.1, we have seen a sample of a partition of the DRPM-VC trained on FMNIST. We provide additional samples for both MNIST and FMNIST at the end of the appendix in Figures 15 and 16. We can see that for both datasets, the DRPM-VC learns coherent representations of each cluster that easily allow us to generate new samples from each class.

(a) NMI when varying K

(b) ARI when varying K

(c) Accuracy when varying K

Figure 8: Clustering performance of Latent GMM, Va DE, and DRPM-VC when varying the number of clusters K for 5 seeds on Fashion MNIST. DRPM-VC consistently outperforms the baselines except for K = 20, where all methods perform similarly.

C.2.7 Samples per cluster

In addition to sampling partitions and then generating samples according to the sampled cluster assignments, we can also directly sample from each of the learned priors. We show some examples of this for both MNIST and FMNIST at the end of the appendix in Figures 17 and 18. We can again see that the DRPM-VC learns accurate cluster representations since each of the samples seems to correspond to one of the classes in the datasets. Further, the clusters also seem to capture the diversity in each cluster, as we see a lot of variety across the generated samples.

C.2.8 Varying the number of Clusters

In previous experiments, we assumed that we had access to the true number of clusters of the dataset, which is, of course, not true in practice. We thus also want to investigate the behavior of the DRPM-VC when varying the number of partitions K of the DRPM and compare it to our baselines. In Figure 8, we show the performance of Latent GMM, Va DE, and DRPM-VC for K {6, 8, 10, 12, 14, 16, 18, 20} across 5 different seeds on FMNIST. DRPM-VC clearly outperforms the two baselines for all K, except for the extreme case of K = 20, where all models seem to perform similarly. Expectedly, DRPM-VC performs well when K is close to the true number of clusters, but performance decreases the farther we are from it. To investigate whether the model still learns meaningful patterns when we are far from the true number of clusters, we additionally generate samples from each prior of the DRPM-VC in Figure 9. Interestingly, we can see that DRPM-VC still detects certain structures in the dataset but starts breaking specific FMNIST categories apart. For instance, it splits the clusters sandals/boots into clusters with (Priors 6/15) and without (Priors 4/9) heels or the cluster T-shirt into clothing of lighter (Prior 0) and darker (Prior 3) color. Thus, DRPM-VC allows us to investigate clusters in datasets hierarchically, where low values of K detect more coarse and higher values of K more fine-grained patterns in the data.

C.2.9 Clustering of STL-10

We include an additional variational clustering ablation on the STL-10 dataset (Coates et al., 2011) that follows the experimental setup of Jiang et al. (Va DE, 2016). As noted by Jiang et al. (2016),

Figure 9: We generate 32 images from each prior when training DRPM-VC on Fashion MNIST with K = 20. DRPM-VC starts detecting more fine-grained patterns in the data and splits some of the original clusters, such as sandals, boots, t-shirts, or handbags, into more specific sub-categories.

Table 6: We compare the clustering performance of the DRPM-VC and Va DE on the test set of STL-10 across five seeds. We measure performance in terms of the Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and cluster accuracy (ACC) over five seeds and put the best model in bold.

NMI ARI ACC

VADE 0.80 0.03 0.76 0.04 0.88 0.02 DRPM-VC 0.83 0.01 0.80 0.02 0.91 0.00

variational clustering algorithms have difficulties clustering in raw pixel space for natural images, which is why we apply DRPM-VC to representations extracted using an Imagenet (Deng et al., 2009) pretrained Res Net-50 (He et al., 2015) as done in Jiang et al. (Va DE, 2016). Note that these results are hard to interpret, as STL-10 is a subset of Imagenet, meaning that representations are relatively easy to cluster as the pretraining task already encourages separation by label. For this reason, we list this experiment separately in the appendix instead of including it in the main text. We present the results of this ablation in Table 6, where we again confirm that modeling cluster assignments with our DRPM can improve upon previous work that modeled the assignments independently.

C.3 Variational Partitioning of Generative Factors

We assume that we have access to multiple instances or views of the same event, where only a subset of generative factors changes between views. The knowledge about the data collection process provides a form of weak supervision. For example, we have two images of a robot arm as depicted here on the left side (see (Gondal et al., 2019)), which we would describe using high-level concepts such as color, position or rotation degree. From the data collection process, we know that a subset of these generative factors is shared between the two views We do not know how many generative factors there are in total nor how many of them are shared. More precisely, looking at the robot arm, we do not know that the views share two latent factors, depicted in red, out of a total of four factors. Please note that we chose four generative in Figure 10 only for illustrative reason as there are seven generative factors in the mpi3d toy dataset. Hence, the goal of learning under weak supervision is not

Figure 10: Motivation for the Partitioning of Generative Factors under weak supervision. The knowledge about the data collection process provides a weak supervision signal. We ohave access to a dataset of pairs of images of the same robot arm with a subset of shared generative factors (in red). We want to learn the shared and independent generative factors in addition to learning from the data. The images of the robot arms are taken from Locatello et al. (2020) but originate from the mpi3d toy dataset (see https://github.com/rr-learning/disentanglement_dataset). The image is from Sutter et al. (2023) and their ICLR 2023 presentation video (see https://iclr. cc/virtual/2023/poster/10707).

only to infer good representations, but also inferring the number of shared and independent generative factors. Learning what is shared and what is independent lets us reason about the group structure without requiring explicit knowledge in the form of expensive labeling. Additionally, leveraging weak supervision and, hence, the underlying group structure holds promise for learning more generalizable and disentangled representations (see (e.g., Locatello et al., 2020)).

C.3.1 Generative Model

We assume the following generative model for DRPM-VAE

z p(X, z)dz (51)

z p(X | z)p(z)dz (52)

where z = {zs, z1, z2}. The two frames share an unknown number ns of generative latent factors zs, and an unknown number, n1 and n2, of independent factors z1 and z2. The RPM infers nk and zk using Y . Hence, the generative model extends to

z p(X | z) X

Y p(z | Y )p(Y )dz

z p(x1, x2 | zs, z1, z2) X

Y p(z | Y )p(Y )dz

zs,z1,z2 p(x1 | zs, z1)p(x2 | zs, z2) X

Y p(zs, z1, z2 | Y )p(Y )dzsdz1dz2 (53)

Figure 11 shows the generative and inference models assumptions in a graphical model.

p(zs, z1, z2 | Y )

p(x1 | zs, z1) p(x2 | zs, z2)

(a) Generative Model

q(z1 | Y, x1)

q(Y | x1, x2)

q(z2 | Y, x2)

q(zs | Y, x1, x2)

(b) Inference Model

Figure 11: Graphical Models for DRPM-VAE models in the weakly-supervised experiment.

C.3.2 DRPM ELBO

We derive the following ELBO using the posterior approximation q(z, Y | X)

LELBO(X) = Eq(z,Y |X)

log p(X | z, Y ) log q(z, Y | X)

= Eq(z,Y |X)

log p(X | z) log q(z | Y, X)q(Y | X)

= Eq(z,Y |X)

log p(x1, x2 | z) log q(z | Y, X)

p(z) log q(Y | X)

= Eq(z,Y |X) [log p(x1 | zs, z1)] Eq(z,Y |X) [log p(x2 | zs, z2)]

log q(zs, z1, z2 | Y, X)

p(zs, z1, z2)

log q(Y | X)

Following Lemma 4.2, we are able to optimize DRPM-VAE using the following ELBO LELBO(X):

LELBO Eq(z,Y |X) [log p(x1 | zs, z1)] Eq(z,Y |X) [log p(x2 | zs, z2)] (58)

log q(zs, z1, z2 | Y, X)

p(zs, z1, z2)

log |ΠY | q(n | X; ω)

p(n; ωp)p(πY ; sp)

log max π q( π | X; s) , (61)

where πY is the permutation that lead to Y during the two-stage resampling process. Further, we want to control the regularization strength of the KL divergences similar to the β-VAE (Higgins et al., 2016). The ELBO L(X) to be optimized can be written as

LELBO = Eq(z,Y |X) [log p(x1 | zs, z1)] + Eq(z,Y |X) [log p(x2 | zs, z2)] (62)

β Eq(z,Y |X)

log q(zs, z1, z2 | Y, X)

p(zs, z1, z2)

log |ΠY | q(n; ω(X))

log max π q( π; s(X))

where s(X) and ω(X) denote distribution parameters, which are inferred from X (similar to the Gaussian parameters in the vanilla VAE).

View Aggregation

qϕ(zs, z1 | x1)

qϕ(zs, z2 | x2)

Figure 12: Setup for the weakly-supervised experiment. The three methods differ only in the View Aggregation module.

As in vanilla VAEs, we can estimate the reconstruction term in Equation (58) with MCMC by applying the reparametrization trick (Kingma and Welling, 2014) to q(z | Y, X) to sample L samples z(l) q(z | Y, X) and compute their reconstruction error to estimate Equation (58). Similarly, we can sample from q(Y | X) L times. We use L = 1 to estimate all expectations in LELBO.

C.3.3 Implementation and Hyperparameters

In this experiment, we use the disentanglement_lib from Locatello et al. (2020). We use the same architectures proposed in the original paper for all methods we compare to. The baseline algorithms, Label VAE (Bouchacourt et al., 2018; Hosoya, 2018) and Ada VAE (Locatello et al., 2020) are already implemented in disentanglement_lib. For details on the implementation of these methods we refer to the original paper from Locatello et al. (2020). HGVAE is implemented in Sutter et al. (2023). We did not change any hyperparameters or network details. All experiments were performed using β = 1 as this is the best performing β (according to Locatello et al. (2020). For DRPMVAE we chose γ = 0.25 for all runs. All models are trained on 5 different random seeds and the reported results are averaged over the 5 seeds. We report mean performance with standard deviations.

We adapted Figure 12 from Sutter et al. (2023). It shows the baseline architecture, which is used for all methods. As already stated in the main part of the paper, the methods only differ in the View Aggregation module, which determines the shared and independent latent factors. Given a subset S of shared latent factors, we have

qϕ(zi | xj) = avg(qϕ(zi | x1), qϕ(zi | x2)) i S (66) qϕ(zi | xj) = qϕ(zi | xj) else (67)

where avg is the averaging function of choice (Locatello et al., 2020; Sutter et al., 2023) and j {1, 2}. The methods used (i. e. Label-VAE, Ada-VAE, HG-VAE, DRPM-VAE) differ in how to select the subset S.

For DRPM-VAE, we infer ω from the pairwise KL-divergences KLpw between the latent vectors of the two views.

KLpw(x1, x2) = 1

2KL[q(z1 | x1)||q(z2 | x2)] + 1

2KL[q(z2 | x2)||q(z1 | x1)] (68)

where q(zj | xj) are the encoder outputs of the respective images. We do not average or sum across dimensions in the computation of KLpw( ) such that the KLpw( ) is d-dimensional, where d is the latent space size. The encoder E in Figure 12 maps to µ(xj) and σ(xj) of a Gaussian distribution. Hence, we can compute the KL divergences above in closed form. Afterwards, we feed the pairwise KL divergence KLpw to a single fully-connected layer, which maps from d to K values

log ω = FC(KLpw(x1, x2)) (69)

where d = 10 and K = 2 in this experiment. d is the total number of latent dimensions and K is the number of groups in the latent space. To infer the scores s(X) we again rely on the pairwise KL divergence KLpw. Instead of using another fully-connected layer, we directly use the log-values of the pairwise KL divergence

log s = log KLpw(x1, x2) (70)

Similar to the original works, we also anneal the temperature parameter for p(n; ω) and p(π; s) (Grover et al., 2019; Sutter et al., 2023). We use the same annealing function as in the clustering experiment (see Appendix C.2). We anneal the temperature τ from 1.0 to 0.5 over the complete training time.

Figure 13: Samples from the noisy Multi MNIST dataset with increasing noise ratio in the right task.

C.4 Multitask Learning

C.4.1 Multi MNIST Dataset

The different tasks in multitask learning often vary in difficulty. To measure the effect of discrepancies in task difficulties on DRPM-MTL, we introduce the noisy Multi MNIST dataset.

The noisy Multi MNIST dataset modifies the Multi MNIST dataset (Sabour et al., 2017) as follows. In the right image, we set each pixel value to zero with probability α [0, 1]. This is done before merging the left and right image in order to only affect the difficulty of the right task. Note that for α = 0 noisy Multi MNIST is equivalent to Multi MNIST and for α = 1 the right task can no longer be solved. This allows us to control the difficulty of the right task, without changing the difficulty of the left. A few examples are shown in Figure 13.

C.4.2 Implementation & Architecture

The multitask loss function for the Multi MNIST dataset is L = w LLL + w RLR (71) where w L and w L are the loss weights, and LL and LR are the individual loss terms for the respective tasks L and R. In our experiments, we set the task weights to be equal for all dataset versions, i.e. w L = w R = 0.5. We use these loss weights for the DRPM-MTL and ULS method. For the ULS method, it is by definition and to see the influence of a mismatch in loss weights. The DRPM-MTL method on the other hand does not need additional weighting of loss terms. The task losses are defined as cross-entropy losses

c=1 gtc log pc = gt T log p (72)

where CL = CR = 10 for Multi MNIST, gt is a one-hot encoded label vector and p is a categorical vector of estimated class assignments probabilities, i.e. P

The predictions for the individual tasks pt are given as pt = hθt(z), where (73) z = encθ(x) (74) for a sample x X (see also Figure 14). We use an adaptation of the Le Net-5 architecture Le Cun et al. (1998) to the multitask learning problem (Sener and Koltun, 2018). Both DRPM-MTL and ULS use the same network encθ( ) with shared architecture up to some layer for both tasks, after which the network branches into two task-specific sub-networks that perform the classifications. Different to the ULS method, the task-specific networks in the DRPM-MTL pipeline predict the digit using only a subset of z. DRPM-MTL uses the following prediction scheme pt = hθt(zt), where (75) zt = z yt (76) yt = DRPM(ω, s)t = DRPM(encφ(x))t (77) The DRPM-MTL encoder first predicts a latent representation z encθ(x), where x is the input image. Using the same encoder architecture but different parameters φ , we predict a partitioning encoding z encφ(x). With a single linear layer per DRPM log-parameter log ω and log s are computed. Next we infer the partition masks y L, y R p(y L, y R; ω, s). We then feed the masked latent representations z L z y L and z R z y R into the task specific classification networks hθL(z L) and hθR(z R) respectively to obtain the task specific predictions. Since the two tasks in the Multi MNIST dataset are of similar nature, the task-specifc networks hθL and hθR share the same architecture, but have different parameters.

p(y L, y R; ω, s)

Figure 14: Overview of the multitask learning pipeline of the DRPM-MTL method.

C.4.3 Training

For both the ULS and the DRPM-MTL model, we use the Adam optimizer with learning rate 0.0005 and train them for 200 epochs with a batch size of 256. We again choose an exponential schedule for the temperature τ and anneal it over the training time, as is explained in Appendix C.2.3.

In our ablation we use α {0, 0.1, 0.2, . . . , 0.9} and train each model with five different seeds. The reported accuracies and partition sizes are then means over the five seeds with the error bands indicating the variance and standard deviation respectively. We evaluate each model after the epoch with the best average test accuracy.

C.4.4 Celeb A for MTL

In addition to the experiment shown in Section 5.3, we show additional results for DRPM-MTL on the Celeb A dataset (Liu et al., 2015). In MTL, each of the 40 attributes of the Celeb A dataset serves as an individual task. Hence, using Celeb A for MTL results is a 40 task learning problem making the scaling of different task losses more difficult compared to Multi MNIST (see Section 5.3) where we only need to scale two different tasks.

We again use the newly introduced DRPM-MTL method and compare it to the ULS model. We use the same pipeline as for Multi MNIST dataset but with different encoders and hyperparameters (see Appendices C.4.2 and C.4.3). We use the pipeline of Sener and Koltun (2018) with a Res Net-based encoder to map an image to a representation of d = 64 dimensions. For architectural details, we refer to Sener and Koltun (2018) and https://github.com/isl-org/Multi Objective Optimization.

Again, ULS inputs all d = 64 dimensions to the task-specific sub-networks whereas DRPM-MTL partitions the intermediate representations into n T different subsets, which are then fed to the respective task networks. n T is the number of tasks.

Compared to the Multi MNIST experiment (see Appendix C.4.2), we introduce an additional regularization for the DRPM-MTL method. The additional regularization is based on the upper bound in Lemma 4.2 and is penalizing size of |ΠY | for a given n. Hence, the loss function changes to

t=1 Lt + λ Lreg (78)

where Lreg = log

t=1 log Γ(nt + 1) (79)

For all versions of the experiment (i.e. n T 10, 20, 40), we set λ = 0.015 1 64, which is the number of elements we want to partition. The task losses Lt are simple BCE losses similar to the Multi MNIST experiments but with two classes per task only.

We perform two different experiments based on the Celeb A experiment. First, we use form a MTL experiments using the first 10 attributes out of the 40 attributes. Second, we increase the number of different tasks to 20. Because we sort the attributes alphabetically in both cases, the first 10 tasks are

Table 7: Results for the MTL experiment on the Celeb A dataset. We compare the DRPM-MTL again to the ULS method. We assess the performance of both methods on two sub-experiment of the Celeb A experiment. In Table 7a, we form a MTL experiment with 10 different tasks. In Table 7b, we form a MTL experiment with 20 different tasks where the first 10 tasks are the same as in the 10 tasks experiment. And in Table 6c, a MTL experiment with all 40 tasks from the dataset. We train both methods for 50 epochs using a learning rate of 0.0001 and a batch size of 128. The temperature annealing schedule remains the same as in the Multi MNIST experiment. We report the per task classification accuracy in percentages (%) as well as the average task accuracy in the bottow row of both subtables.

(a) 10 Tasks

T0 92.0 0.5 92.4 0.5 T1 83.8 0.4 83.7 0.2 T2 80.2 0.5 80.2 0.4 T3 81.9 0.8 82.2 0.6 T4 98.5 0.2 98.5 0.1 T5 95.2 0.2 95.3 0.2 T6 80.0 1.4 82.4 0.4 T7 82.0 0.3 82.2 0.2 T8 89.7 0.7 90.7 0.2 T9 94.6 0.5 95.0 0.2 avg(Tasks) 87.8 0.3 88.3 0.1

(b) 20 Tasks

T0 92.4 0.7 93.0 0.2 T1 83.7 0.6 83.9 0.7 T2 79.9 0.6 80.1 0.4 T3 82.4 0.5 83.0 0.7 T4 98.6 0.1 98.6 0.1 T5 95.2 0.1 95.5 0.0 T6 82.0 1.3 84.4 0.4 T7 82.5 0.1 82.8 0.2 T8 90.1 0.9 91.0 0.4 T9 94.7 0.2 95.1 0.1 T10 95.9 0.1 95.9 0.1 T11 84.9 0.1 84.6 0.3 T12 91.0 0.4 91.6 0.2 T13 94.7 0.1 94.9 0.1 T14 95.4 0.3 96.0 0.1 T15 99.2 0.0 99.2 0.1 T16 95.8 0.3 96.0 0.1 T17 97.3 0.3 97.5 0.2 T18 91.2 0.3 91.2 0.1 T19 87.0 0.3 87.3 0.2 avg(Tasks) 90.7 0.2 91.1 0.1

shared between the two experiment versions. And third, we set n T = 40, where the first 20 tasks are the shared with the previous experiment.

Table 7 shows the results of all Celeb A experiments for both methods, ULS and DRPM-MTL. We see that the DRPM-MTL scales better to a larger number of tasks compared to the ULS method, highlighting the importance of finding new ways of automatic scaling between tasks. Interestingly, the DRPM-MTL outperforms the ULS method on most tasks for the 20-tasks experiment even though it has only access to d/n T = 64/20 = 3.2 dimensions on average. And even more extrem for n T = 40, DRPM-MTL on average has only access to d/n T = 64/40 = 1.6 dimensions. On the other hand, the ULS method can access the full set of 64 dimensions for every single task.

C.5 Supervised Learning

Given the true partition Y , we can also adapt the DRPM to learn partitions in a supervised fashion. One instance where we know the true partition of elements is in the case of classification. There, for a given batch X containing B samples, we have Y := (y1, . . . , y B) where yi RK is the one-hot encoding of the label of the i-th sample in the batch. In this ablation, we infer ˆY := DRPM(ωθ1(X), sθ2(X)), where we compute ω(X) and s(X) as in the variational clustering experiment (Appendix C.2.2) and use the DRPM without resampling as in the multitask experiment (Section 5.3). We optimize

(c) 40 Tasks

T0 92.9 0.6 92.6 0.5 T1 83.4 0.5 83.8 0.5 T2 80.6 0.8 80.6 0.3 T3 82.8 1.1 83.1 0.4 T4 98.6 0.1 98.7 0.1 T5 95.4 0.1 95.4 0.2 T6 81.6 2.2 84.4 1.2 T7 82.7 0.3 82.6 0.2 T8 90.1 0.8 90.7 0.5 T9 94.8 0.2 95.0 0.1 T10 96.0 0.2 95.9 0.1 T11 85.0 0.5 85.0 0.3 T12 91.6 0.6 92.2 0.1 T13 94.8 0.2 94.8 0.2 T14 95.6 0.4 95.7 0.2 T15 99.3 0.1 99.3 0.1 T16 96.2 0.2 96.1 0.1 T17 97.4 0.2 97.5 0.1 T18 91.1 0.1 91.3 0.3 T19 87.1 0.3 87.0 0.5 T20 98.6 0.0 98.5 0.1 T21 93.6 0.1 93.6 0.2 T22 96.0 0.1 95.9 0.2 T23 91.8 0.4 92.6 0.6 T24 95.4 0.2 95.5 0.1 T25 71.5 1.6 72.9 1.2 T26 96.0 0.4 96.4 0.2 T27 74.9 0.5 76.1 0.3 T28 93.5 0.5 94.0 0.3 T29 92.8 0.2 93.5 0.2 T30 96.3 0.3 96.1 0.3 T31 92.7 0.2 93.0 0.1 T32 81.0 0.4 82.1 0.4 T33 83.3 0.2 83.8 0.6 T34 89.2 0.2 89.3 0.1 T35 98.9 0.0 99.0 0.0 T36 91.7 0.1 91.8 0.2 T37 87.6 0.6 88.1 0.1 T38 95.5 0.5 95.6 0.2 T39 87.9 0.1 82.7 3.2 avg(Tasks) 90.6 0.1 90.8 0.1

parameters θ1 and θ2 by minimizing the following loss:

L(X, Y ) : = L1(X, Y ) + αL2(X, Y )

L1(X, Y ) := 1

B LCE( ˆY , Y )

L2(X, Y ) := 1

where LCE denotes the standard cross-entropy loss, n denotes the output of the MVHG leading to ˆY , and L2 ensures that ni matches the number of appearances of label i in the current batch. Using this simple training scheme, we achieve an f1-score of 96.43 0.02 and 82.71 0.03 on MNIST and FMNIST, respectively, further demonstrating the applicability and versatility of the DRPM to a number of different problems.

Figure 15: Additional partition samples from the DRPM-VC trained on MNIST. The different sets of each partition match each of the digits very well, even after repeatedly sampling from the model.

Figure 16: Additional partition samples from the DRPM-VC trained on FMNIST. Most clusters accurately represent one of the clothing categories and generate new samples very well. The only problem is with the handbag class, where the DRPM-VC learns two different clusters for different kinds of handbags (cluster 5 and 6).

Figure 17: Various samples from each of the generative priors. Each prior learns to represent one of the digits. Further, we see a lot of variation between the different samples, suggesting that the clusters of the DRPM-VC manage to capture some of the diversity present in the dataset.

Figure 18: Various samples from each of the generative priors. Each prior learns to represent one of the digits. The DRPM-VC learns nice representations that provide coherent generations of most classes. For high-heels (cluster 4), generating new samples seems difficult due to the heterogeneity within that class.