# unsupervised_learning_under_latent_label_shift__e60b10bc.pdf Unsupervised Learning under Latent Label Shift Manley Roberts Pranav Mani Saurabh Garg Zachary C. Lipton Carnegie Mellon University {manleyroberts,zlipton}@cmu.edu; {pmani, sgarg2}@cs.cmu.edu What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals pdpyq can shift across domains but the class conditionals ppx|yq do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label s support contains a separable region, analogous to an anchor word, oracle access to ppd|xq suffices to identify pdpyq and pdpy|xq up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator ppd|xq; (ii) discretize the data by clustering examples in ppd|xq space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered ppy|dq with the discriminator outputs ppd|xq to compute pdpy|xq @d. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve upon competitive unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when dataspace similarity does not indicate true groupings, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work2. 1 Introduction Discovering systems of categories from unlabeled data is a fundamental but ill-posed challenge in machine learning. Typical unsupervised learning methods group instances together based on featurespace similarity. Accordingly, given a collection of photographs of animals, a practitioner might hope that, in some appropriate feature space, images of animals of the same species should be somehow similar to each other. But why should we expect a clustering algorithm to recognize that dogs viewed in sunlight and dogs viewed at night belong to the same category? Why should we expect that butterflies and caterpillars should lie close together in feature space? In this paper, we offer an alternative principle according to which we might identify a set of classes: we exploit distribution shift across times and locations to reveal otherwise unrecognizable groupings among examples. For example, if we noticed that whenever we found ourselves in a location where butterflies are abundant, caterpillars were similarly abundant, and that whenever butterflies were scarce, caterpillars had a similar drop in prevalence, we might conclude that the two were tied to the same underlying concept, no matter how different they appear in feature space. In short, our principle suggests that latent classes might be uncovered whenever instances that shift together group together. Equal contribution. 2Code is available at https://github.com/acmi-lab/Latent-Label-Shift-DDFA. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Domain Discriminator Figure 1: Schematic of our DDFA algorithm. After training a domain discriminator, we (i) push all data through the discriminator; (ii) cluster the data based on discriminator outputs; (iii) solve the resulting discrete topic modeling problem and then combine pqpd|xq and pqpy, dq to estimate ppdpy|xq. Formalizing this intuition, we introduce the problem of unsupervised learning under Latent Label Shift (LLS). Here, we assume access to a collection of domains d P t1, . . . , ru, where the mixture proportions pdpyq vary across domains but the class conditional distribution ppx|yq is domaininvariant. Our goals are to recover the underlying classes up to permutation, and thus to identify both the per-domain mixture proportions pdpyq and optimally adapted per-domain classifiers pdpy|xq. The essential feature of our setup is that only the true y s, as characterized by their class-conditional distributions ppx|yq, could account for the observed shifts in pdpxq. We prove that under mild assumptions, knowledge of this underlying structure is sufficient for inducing the full set of categories. First, we focus on the tabular setting, demonstrating that when the input space is discrete and finite, LLS is isomorphic to topic modeling [9]. Here, each distinct input x maps to a word each latent label y maps to a topic and each domain d maps to a document. In this case, we can apply standard identification results for topic modeling [22, 5, 30, 38, 14] that rely only on the existence of anchor words within each topic (i.e., for each label y there is at least one x in the support of y, that is not in the support of any y1 y). Here, standard methods based on Non-negative Matrix Factorization (NMF) can recover each domain s underlying mixture proportion pdpyq and optimal predictor pdpy|xq [22, 38, 30]. However, the restriction to discrete inputs, while appropriate for topic modeling, proves restrictive when our interests extend to high-dimensional continuous input spaces. Then, to handle high-dimensional inputs, we propose Discriminate-Discretize-Factorize-Adjust (DDFA), a general framework that proceeds in the following steps: (i) pool data from all domains to produce a mixture distribution qpx, dq; (ii) train a domain discriminative model f to predict qpd|xq; (iii) push all data through f, cluster examples in the pushforward distribution, and tabularize the data based on cluster membership; (iv) solve the resulting discrete topic modeling problem (e.g., via NMF), estimating qpy, dq up to permutation of the latent labels; (v) combine the predicted qpd|xq and qpy, dq to estimate pdpyq and pdpy|xq. In developing this approach, we draw inspiration from recent works on distribution shift and learning from positive and unlabeled data that (i) leverage black box predictors to perform dimensionality reduction [46, 25, 26]; and (ii) work with anchor sets, separable subsets of continuous input spaces that belong to only one class s support [65, 49, 23, 7, 26]. Our key theoretical result shows that domain discrimination (qpd|xq) provides a sufficient representation for identifying all parameters of interest. Given oracle access to qpd|xq (which is identified without labels), our procedure is consistent. Our analysis reveals that the true qpd|xq maps all points in the same anchor set to a single point mass in the push-forward distribution. This motivates our practical approach of discretizing data by hunting for tight clusters in qpd|xq space. In semi-synthetic experiments, we adapt existing image classification benchmarks to the LLS setting, sampling without replacement to construct collections of label-shifted domains. We note that training a domain discriminator classifier is a difficult task, and find that warm starting the initial layers of our model with pretrained weights from unsupervised approaches can significantly boost performance. We show that warm-started DDFA outperforms competitive unsupervised approaches on CIFAR10 and CIFAR-20 when domain marginals pdpyq are sufficiently sparse. In particular, we observe improvements of as much as 30% accuracy over a recent high-performing unsupervised method on CIFAR-20. Further, on subsets of Field Guide dataset, where similarity between species and diversity within a species leads to failure of unsupervised learning, we show that DDFA recovers the true distinctions. To be clear, these are not apples-to-apples comparisons: our methods are specifically tailored to the LLS setting. The takeaway is that the structure of the LLS setting can be exploited to outperform the best unsupervised learning heuristics. 2 Related Work Unsupervised Learning Standard unsupervised learning approaches for discovering labels often rely on similarity in the original data space [50, 62]. While distances in feature space become meaningless for high-dimensional data, deep learning researchers have turned to similarity in a representation space learned via self-supervised contrastive tasks [53, 21, 29, 12], or similarity in a feature space learned end-to-end for a clustering task [10, 11, 57, 72]. Our problem setup closely resembles independent component analysis (ICA), where one seeks to identify statistically independent signal components from mixtures [39]. However, ICA s assumption of statistical independence among the components does not generally hold in our setup. In topic modeling [9, 5, 38, 14, 56], documents are modeled as mixtures of topics, and topics as categorical distributions over a finite vocabulary. Early topic models include the well-known Latent Dirichlet Allocation (LDA) [9], which assumes that topic mixing coefficients are drawn from a Dirichlet distribution, along with papers with more relaxed assumptions on the distribution of topic mixing coefficients (e.g., p LSI) [37, 56]. The topic modeling literature often draws on Non-negative Matrix Factorization (NMF) methods [54, 66], which decompose a given matrix into a product of two matrices with non-negative elements [20, 18, 28, 31]. In both Topic Modeling and NMF, a fundamental problem has been to characterize the precise conditions under which the system is uniquely identifiable [22, 5, 38, 14]. The anchor condition (also referred to as separability) is known to be instrumental for identifying topic models [5, 14, 38, 22]. In this work, we extend these ideas, leveraging separable subsets of each label s support (the anchor sets) to produce anchor words in the discretized problem. Existing methods have attempted to extend latent variable modeling to continuous input domains by making assumptions about the functional forms of the class-conditional densities, e.g., restricting to Gaussian mixtures [62, 60]. Recent work in nonparametric mixture modeling [63, 73] employs the grouping of unlabeled samples to better identify unknown mixtures, but unlike our setting, each group is known to be sampled only from a single mixture component. A second line of approach involves finding an appropriate discretization of the continuous space [71]. Distribution Shift under the Label Shift Assumption The label shift assumption, where pdpyq can vary but ppx|yq cannot, has been extensively studied in the domain adaptation literature [64, 68, 77, 46, 33, 25] and also features in the problem of learning from positive and unlabeled data [24, 8, 26]. For both problems, many classical approaches suffer from the curse of dimensionality, failing in the settings where deep learning prevails. Our solution strategy draws inspiration from recent work on label shift [46, 1, 6, 25] and PU learning [8, 49, 65, 26, 27] that leverage black-box predictors to produce sufficient low-dimensional representations for identifying target distributions of interest (other works leverage black box predictors heuristically [41]). Key differences: While PU learning requires identifying one new class for which we lack labeled examples provided that the positive class contains an anchor set [26], LLS can identify an arbitrary number of classes (up to permutation) from completely unlabeled data, provided a sufficient number of domains. Domain Generalization The related problem of Domain Generalization (DG) also addresses learning with data drawn from multiple distributions and where the domain identifiers play a key role [51, 3]. However in DG, we are given labeled data from multiple domains, and our goal is to learn a classifier that can generalize to new domains. By contrast, in LLS, we work with unlabeled data only, leveraging the problem structure to identify the underlying labels. Learning with Label Noise (LLN) Another problem type that can be viewed under the LLS lens is Learning with Label Noise (LLN) [58, 32, 45, 55, 61, 70, 47, 48]. This connection can be established as follows: Consider the observed noisy labels to act as domains (see 1). We then have (i) different distributions over the true labels conditioned on the noisy label; (ii) When conditioned on the true label, the distribution over the input space is independent of the noisy label. These observations indicate the satisfaction of LLS setup (see 1) in LLN, a key difference being the LLN literature typically assumes the distribution over the noisy label has the highest mass at py i when the true label y i. The connection between mixture proportion estimation and LLN has been identified in [65]. Many methods have been proposed to tackle the LLN problem: (i) loss correction and surrogate risk minimization [58, 32, 65, 52]; (ii) leveraging the predictions of Deep Networks to assign soft labels, as Figure 2: Relationship under Q between observed D, observed X, and latent Y . teacher-student models or in Mix Up like augmentations [34, 13, 61, 70, 2]; (iii) tapping into the early learning of meaningful patterns by Deep Networks [45, 47, 67]. Of particular relevance is [58], which estimates the Noise Transition Matrix (NTM) and then uses this estimate to retrain under a corrected loss function. [58] shows that "perfect examples" allow for the identification of the NTM (similar to the anchor subdomain assumption which appears in this work). However, [58] use the assumption that when the true label takes a certain value, the highest probability mass for the noisy label is also at the same value an assumption not always satisfied in our setting. Further, in the LLN problem, the number of domains and classes are always equal, rendering LLN methods such as [58] incapable of using the information provided by additional domains present in non-LLN settings. Our method can leverage domain information to provide a domain calibrated classifier, but [58] does not exploit the possible availability of a noisy label at inference. In recent work, [76] use methods from [58] to connect LLN to learning from multiple bags of samples for which the label proportions are known (similar to our setting, but with more information available in the form of these label proportions). 3 Latent Label Shift Setup Notation For a vector v P Rp, we use vj to denote its jth entry, and for an event E, we let I r Es denote the binary indicator of the event. By |A|, we denote the cardinality of set A. With rns, we denote the set t1, 2, . . . , nu. We use r Asi,j to access the element at pi, jq in A. Let X be the input space and Y t1, 2, . . . , ku be the output space for multiclass classification. We assume throughout this work that the number of true classes k is known. Throughout this paper, we use capital letters to denote random variables and small case letters to denote the corresponding values they take. For example, by X we denote the input random variable and by x, we denote a value that X may take. We now formally introduce the problem of unsupervised learning under LLS. In LLS, we assume that we observe unlabeled data from r domains. Let R t1, 2, . . . , ru be the set of domains. By pd, we denote the probability density (or mass) function for each domain d P R. Definition 1 (Latent label shift). We observe data from r domains. While the label distribution can differ across the domains, for all d, d1 P R and for all px, yq P X ˆ Y, we have pdpx|yq pd1px|yq. Simply put, Definition 1 states that the conditional distribution pdpx|yq remains invariant across domains, i.e., they satisfy the label shift assumption. Thus, we can drop the subscript on this factor, denoting all pdpx|yq by ppx|yq. Crucially, under LLS, pdpyq can vary across different domains. Under LLS, we observe unlabeled data with domain label tpx1, d1q, px2, d2q, . . . , pxn, dnqu. Our goal breaks down into two tasks. Up to permutation of labels, we aim to (i) estimate the label marginal in each domain pdpyq; and (ii) estimate the optimal per-domain predictor pdpy|xq. Mixing distribution Q A key step in our algorithm will be to train a domain discriminative model. Towards this end we define Q, a distribution over X ˆY ˆR, constructed by taking a uniform mixture over all domains. By q, we denote the probability density (or mass) function of Q. Define Q such that qpx, y|D dq pdpx, yq, i.e., when we condition on D d we recover the joint distribution over X ˆ Y specific to that domain d. For all d P R, we define γd qpdq, i.e., the prevalence of each domain in our distribution Q. Notice that qpx, yq is a mixture over the distributions tpdpx, yqud PR, with tγdud PR as the corresponding mixture coefficients. Under LLS (Definition 1), X does not depend on D when conditioned on Y (Fig. 2). Additional notation for the discrete case To begin, we setup notation for discrete input spaces with |X| m. Without loss of generality, we assume that X t1, 2, . . . , mu. The label shift assumption allows us to formulate the label marginal estimation problem in matrix form. Let QX|D be an m ˆ r matrix such that r QX|Dsi,d pdp X iq, i.e., the d-th column of QX|D is pdpxq. Let QX|Y be an mˆk matrix such that r QX|Y si,j pp X i|Y jq, the j-th column is a distribution over X given Y j. Similarly, define QY |D as a k ˆ r matrix whose d-th column is the domain marginal pdpyq. Now with Definition 1, we have pdpxq ř y pdpx, yq ř y pdpx|yqpdpyq ř y ppx|yqpdpyq. Since this is true @d P R, we can express this in a matrix form as QX|D QX|Y QY |D. Additional assumptions Before we present identifiability results for the LLS problem, we introduce four additional assumptions required throughout the paper: A.1 There are at least as many domains as classes, i.e., |R| ě |Y|. A.2 The matrix formed by label marginals (as columns) across different domains is full-rank, i.e., rankp QY |Dq k. A.3 Equal representation of domains, i.e., for all d P R, γd 1{r. A.4 Fix ϵ ą 0. For all y P Y, there exists a unique subdomain Ay Ď X, such that qp Ayq ě ϵ and x P Ay if and only if the following conditions are satisfied: qpx|yq ą 0 and qpx|y1q 0 for all y1 P Yztyu. We refer to this assumption as the ϵ-anchor sub-domain condition. We now comment on the assumptions. A.1 A.2 are benign, these assumptions just imply that the matrix QY |D is full row rank. Without loss of generality, A.3 can be assumed when dealing with data from a collection of domains. When this condition is not satisfied, one could just re-sample data points uniformly at random from each domain d. Intuitively, A.4 states that for each label y P Y, we have some subset of inputs that only belong to that class y. To avoid vanishing probability of this subset, we ensure at least ϵ probability mass in our mixing distribution Q. The anchor word condition is related to the positive sub-domain in PU learning, which requires that there exists a subset of X in which all examples only belong to the positive class [65, 49, 23, 7]. 4 Theoretical Analysis In this section, we establish identifiability of LLS problem. We begin by considering the case where the input space is discrete and formalize the isomorphism to topic modeling. Then we establish the identifiability of the system in this discrete setting by appealing to existing results in topic modeling [38]. Finally, extending results from discrete case, we provide novel analysis to establish our identifiability result for the continuous setting. Isomorphism to topic modeling Recall that for the discrete input setting, we have the matrix formulation: QX|D QX|Y QY |D. Consider a corpus of r documents, consisting of terms from a vocabulary of size m. Let D be an Rmˆr matrix representing the underlying corpus. Each column of D represents a document, and each row represents a term in the vocabulary. Each element r Dsi,j represents the frequency of term i in document j. Topic modeling [9, 37, 38, 5] considers each document to be composed as a mixture of k topics. Each topic prescribes a frequency with which the terms in the vocabulary occur given that topic. Further, the proportion of each topic varies across documents with the frequency of terms given topic remaining invariant. We can state the topic modeling problem as: D CW, where C is an Rmˆk matrix, r Csi,j represents the frequency of term i given topic j, and W is an Rkˆr matrix, where r Wsi,j represents the proportion of topic i in document j. Note that all three matrices are column normalized. The isomorphism is then between document and domain, topic and label, term and input sample, i.e., D CW QX|D QX|Y QY |D. In both the cases, we are interested in decomposing a known matrix into two unknown matrices. This formulation is examined as a non-negative matrix factorization problem with an added simplicial constraint on the columns (columns sum to 1) [4, 30]. Identifiability of the topic modeling problem is well-established [22, 5, 30, 38, 14]. We leverage the isomorphism to topic modeling to extend this identifiability condition to our LLS setting. We formalize the adaptation here: Theorem 1. (adapted from Proposition 1 in Huang et al. [38]) Assume A.1, A.2 and A.4 hold (A.4 in the discrete setting is referred to as the anchor word condition). Then the solution to QX|D QX|Y QY |D is uniquely identified up to permutation of class labels. We refer readers to Huang et al. [38] for a proof of this theorem. Intuitively, Theorem 1 states that if each label y has at least one token in the input space that has support only in y, and A.1, A.2 hold, then the solution to QX|Y , QY |D is unique. Furthermore, under this condition, there exist algorithms that can recover QX|Y , QY |D within some permutation of class labels [38, 30, 4, 5]. Extensions to the continuous case We will prove identifiability in the continuous setting, when X Rp for some p ě 1. In addition to A.1 A.4, we make an additional assumption that we have oracle access to qpd|xq, i.e., the true domain discriminator for mixture distribution Q. This is implied by assuming access to the marginal qpx, dq from which we observe our samples. Formally, we define a push forward function f such that rfpxqsd qpd|xq, then push the data forward through f to obtain outputs in r 1. In the proof of Theorem 2, we will show that these outputs can be discretized in a fashion that maps anchor subdomains to anchor words in a tabular, discrete setting. We separately remark that the anchor word outputs are in fact extreme corners of the convex polytope in r 1 which encloses all fpxq mass; we discuss this geometry further in App. F. After constructing the anchor word discretization, we appeal to Theorem 1 to recover QY |D. Given QY |D, we show that we can use Bayes rule and the LLS condition (Definition 1) to identify the distribution qpy|x, dq pdpy|xq over latent variable y. We formalize this in the following theorem: Theorem 2. Let the distribution Q over random variables X, Y, D satisfy Assumptions A.1 A.4. Assuming access to the joint distribution qpx, dq, and knowledge of the number of true classes k, we show that the following quantities are identifiable: (i) QY |D, (ii) qpy|X xq , for all x P X that lies in the support (i.e. qpxq ą 0); and (iii) qpy|X x, D dq , for all x P X and d P R such that qpx, dq ą 0. Before presenting a proof sketch for Theorem 2, we first present key lemmas (we include their proofs in App. A). Lemma 1. Under the same assumptions as Theorem 2, the matrix QY |D and fpxq qpd|xq uniquely determine qpy|xq for all y P Y and x P X such that qpxq ą 0. Lemma 1 states that given matrix QY |D and oracle domain discriminator, we can uniquely identify qpy|xq. In particular, we show that for any x P X, qpd|xq can be expressed as a convex combination of the k columns of QD|Y (which is computed from QY |D and is column rank k) and the coefficients of the combination are qpy|xq. Combining this with the linear independence of the columns of QD|Y , we show that these coefficients are unique. In the following lemma, we show how the identified qpy|xq can then be used to identify qpy|x, dq: Lemma 2. Under the same assumptions as Theorem 2, for all y P Y, and x P X such that qpx, dq ą 0. the matrix QY |D and qpy|xq uniquely determine qpy|x, dq. To prove Lemma 2, we show that we can combine the conditional distribution over the labels given a sample x P X with the prior distribution of the labels in each domain to determine the posterior distribution over labels given the sample x and the domain of interest. Next, we introduce a key property of the domain discriminator classifier f: Lemma 3. Under the same assumptions as Theorem 2, for all x, x1 in anchor sub-domain, i.e., x, x1 P Ay for a given label y P Y, we have fpxq fpx1q. Further, for any y P Y, if x P Ay, x1 R Ay, then fpxq fpx1q. Lemma 3 implies that the oracle domain discriminator f maps all points in an anchor subdomain, and only those points in that anchor subdomain to the same point in fpxq qpd|xq space. We can now present a proof sketch for Theorem 2 (full proof in App. B): Proof sketch of Theorem 2. The key idea of the proof lies in proposing a discretization such that some subset of anchor subdomains for each label y in the continuous space map to distinct anchor words in discrete space. In particular, if there exists a discretization of the continuous space X that for any y P Y, maps all x P Ay to the same point in the discrete space, but no x R Ay maps to this point, then this point serves as an anchor word. From Lemma 3, we know that all the x P Ay and only the x P Ay get mapped to specific points in the fpxq space. Pushing all the x P X through f, we know from A.4 that there exists k point masses of size ϵ, one for each fp Ayq in the fpxq qpd|xq space. We can now inspect this space for point masses of size at least ϵ to find at most Op1{ϵq such point masses among which are contained the k point masses corresponding to the anchor subdomains. Discretizing this space by assigning each point mass to a group (and non-point masses to a single additional group), we have k groups that have support only in one y each. Thus, we have achieved a discretization with anchor words. Further, since the discrete space arises from a pushforward of the continuous space through f, the discrete space also satisfies the latent label shift assumption A.1. We now use Theorem 1 to claim identifiability of QY |D. We then use Lemmas 1 and 2 to prove parts (ii) and (iii). 5 DDFA Framework Motivated by our identifiability analysis, in this section, we present an algorithm to estimate QY |D, qpy|xq, and qpy|x, dq when X is continuous by exploiting domain structure and approximating the true domain discriminator f. Intuitively, qpy|x, dq is the domain specific classifier pdpy|xq and qpy|xq is the classifier for data from aggregated domains. QY |D captures label marginal for individual domains. A naive approach would be to aggregate data from different domains and exploit recent advancements in unsupervised learning [72, 57, 10, 11]. However, aggregating data from multiple domains loses the domain structure that we hope to leverage. We highlight this failure mode of the traditional unsupervised clustering method in Sec. 6. We remark that DDFA draws heavy inspiration from the proof of Theorem 2, but we do not present a guarantee that the DDFA solution will converge to the identifiable solution. This is primarily due to the K-means clustering heuristic we rely on, which empirically offers effective noise tolerance, but theoretically has no guarantee of correct convergence. Discriminate We begin Algorithm 1 by creating a split of the unlabeled samples into the training and validation sets. Using the unlabeled data samples and the domain that each sample originated from, we first train a domain discriminator pf. The domain discriminator outputs a distribution over domains for a given input. This classifier is trained with cross-entropy loss to predict the domain label of each sample on the training set. With unlimited data, the minimizer of this loss is the true f, as we prove in App. C. To avoid overfitting, we stop training pf when the cross-entropy loss on the validation set stops decreasing. Note that here the validation set also only contains domain indices (and no information about true class labels). Discretize We now push forward all the samples from the training and validation sets through the domain discriminator to get vector pfpxiq for each sample xi. In the proof of Theorem 2, we argue that when working with true f, and the entire marginal qpx, dq, we can choose a discretization satisfying the anchor word assumption by identifying point masses in the distribution of fpxq and filtering to include those of at least ϵ mass. In the practical setting, because we have only a finite set of data points and a noisy pf, we use clustering to approximately find point masses. We choose m ě k and recover m clusters with any standard clustering procedure (e.g. K-means). This clustering procedure is effectively a useful, but imperfect heuristic: if the noise in pf is sufficiently small and the clustering sufficiently granular, we hope that our m discovered clusters will include k pure clusters, each of which only contains data points from a different anchor subdomain which are tightly packed around the true fp Ayq for the corresponding label y. Clustering in this space is superior to a naive clustering on the input space because close proximity in this space indicates similarity in qpd|xq. Let us denote the learned clustering function as c, where cpxq is the cluster assigned to a datapoint x. We now leverage the cluster id cpxiq of each sample xi to discretize samples into a finite discrete space rms. Combining the cluster id with the domain source di for each sample, we estimate p Qcp Xq|D by simply computing, for each domain, the fraction of its samples assigned to each cluster. Factorize We apply an NMF algorithm to p Qcp Xq|D to obtain estimates of p Qcp Xq|Y and p QY |D. Adjust We begin Algorithm 2 by considering a test point px1, d1q. To make a prediction, if we had access to oracle f and true QY |D, we could precisely compute qpy|x1q (Lemma 1). However, in place of these true quantities, we plug in the estimates pf and p QY |D. Since our estimates contain noise, the estimate pqpy|x1q is found by left-multiplying pfpx1q with the pseudo-inverse of p QD|Y , as opposed to solving a sufficient system of equations. As our estimates pf and p QD|Y approach the true values, the projection of pfpx1q into the column space of p QD|Y tends to pfpx1q itself, so the pseudo-inverse approaches the true solution. Now we can use the constructive procedure introduced in the proof of Lemma 2 to compute the plug-in estimate pqpy|x1, d1q ppd1py|x1q. 6 Experiments Experiment code is available at https://github.com/acmi-lab/Latent-Label-Shift-DDFA. Baselines We select the unsupervised classification method SCAN, as a high-performing competitive baseline [72]. SCAN pretrains a Res Net [35] backbone using Sim CLR [12] and Mo Co [36] setups (pretext tasks). SCAN then trains a clustering head to minimize the SCAN loss (refer [72] for more Algorithm 1 DDFA Training input k ě 1, r ě k, tpxi, diqui Prns qpx, dq, A class of functions F from Rp Ñ Rr 1: Split into train set T and validation set V 2: Train pf P F to minimize cross entropy loss for predicting d|x on T with early stopping on V 3: Push all txiui Prns through pf 4: Train clustering algorithm on the n points t pfpxiqui Prns, obtain m clusters. 5: cpxiq Ð Cluster id of pfpxiq 6: pqpcp Xq a|D bq Ð i Prns Ircpxiq a, di bs ř j Prns Irdj bs 7: Populate p Qcp Xq|D as r p Qcp Xq|Dsa,b Ð pqpcp Xq a|D bq 8: p Qcp Xq|Y , p QY |D Ð NMF p p Qcp Xq|Dq output p QY |D, pf Algorithm 2 DDFA Prediction input p QY |D, pf, px1, d1q qpx, dq 1: Populate p QD|Y as r p QD|Y sd,y Ð r p QY |Dsy,d řd2 r d2 1r p QY |Dsy,d2 2: Assign pqpy|X x1q Ð p QD|Y : pfpx1q ȷ 3: Assign pqpy|X x1, D d1q Ð r p QD|Y sd1,ypqpy|X x1q ř y2Prks r p QD|Y sd1,y2pqpy2|X x1q 4: ypred Ð arg maxy Prks pqpy|X x1, D d1q output : pqpy|X x1, D d1q ppd1py|x1q, pqpy|X x1q, ypred details) 3. We do not use the SCAN self-labeling step. We make sure to evaluate SCAN on the same potentially class-imbalanced test subset we create for each experiment. Since SCAN is fit on a superset of the data DDFA sees, we believe this gives a slight data advantage to the SCAN baseline (although we acknowledge that the class balance for SCAN training is also potentially different from its evaluation class balance). To evaluate SCAN, we use the public pretrained weights available for CIFAR-10, CIFAR-20, and Image Net-50. We also train SCAN ourselves on the train and validation portions of the Field Guide2 and Field Guide28 datasets with a Res Net18 backbone and Sim CLR pretext task. We replicate the hyperparameters used for CIFAR training. Datasets First we examine standard multiclass image datasets CIFAR-10, CIFAR-20 [43], and Image Net-50 [19] containing images from 10, 20, and 50 classes respectively. Images in these datasets typically focus on a single large object which dominates the center of the frame, so unsupervised classification methods which respond strongly to similarity in visual space are well-suited to recover true classes up to permutation. These datasets are often believed to be separable (i.e., single true label applies to each image), so every example falls in an anchor subdomain (satisfying A.4). Motivated by the application of LLS problem, we consider the Field Guide dataset 4, which contains images of moths and butterflies. The true classes in this dataset are species, but each class contains images taken in immature (caterpillar) and adult stages of life. Based on the intuition that butterflies from a given species look more like butterflies from other species than caterpillars from their own species, we hypothesize that unsupervised classification will learn incorrect class boundaries which distinguish caterpillars from butterflies, as opposed to recovering the true class boundaries. Due to high visual similarity between members of different classes, this dataset may indeed have slight overlap between classes. However, we hypothesize that anchor subdomain still holds, i.e., there exist some images from each class that could only come from that class. Additionally, if we have access to data from multiple domains, it is natural to assume that within each domain the relative distribution of caterpillar to adult stages of each species stay relatively constant as compared to 3SCAN code: https://github.com/wvangansbeke/Unsupervised-Classification 4Field Guide: https://sites.google.com/view/fgvc6/competitions/butterflies-moths-2019 Table 1: Results on CIFAR-20. Each entry is produced with the averaged result of 3 different random seeds. With DDFA (RI) we refer to DDFA with randomly initialized backbone. With DDFA (SI) we refer to DDFA s backbone initialized with SCAN. Note that in DDFA (SI), we do not leverage SCAN for clustering. α is the Dirichlet parameter used for generating label marginals in each domain, κ is the maximum allowed condition number of the generated QY |D matrix, r is number of domains. r Approaches α : 0.5, κ : 8 α : 3, κ : 12 α : 10, κ : 20 Test acc QY |D err Test acc QY |D err Test acc QY |D err 20 SCAN 0.454 0.059 0.421 0.051 0.436 0.038 DDFA (RI) 0.520 0.041 0.357 0.043 0.187 0.051 DDFA (SI) 0.852 0.015 0.548 0.026 0.354 0.036 25 SCAN 0.458 0.059 0.455 0.048 0.440 0.037 DDFA (RI) 0.525 0.042 0.310 0.051 0.182 0.053 DDFA (SI) 0.819 0.021 0.707 0.022 0.502 0.030 30 SCAN 0.456 0.059 0.441 0.050 0.437 0.037 DDFA (RI) 0.506 0.045 0.256 0.058 0.088 0.076 DDFA (SI) 0.845 0.020 0.688 0.027 0.531 0.029 prevalence of different species. We create two subsets of this dataset: Field Guide2, with two species, and Field Guide28, with 28 species. LLS Setup The full sampling procedure for semisynthetic experiments is described in App. D. Roughly, we sample pdpyq from a symmetric Dirichlet distribution with concentration α{k, where k is the number of classes and α is a generation parameter that adjusts the difficulty of the synthetic problem, and enforce maximum condition number κ on QY|D. Small α and small κ encourages sparsity in QY|D, so each label tends to only appear in a few domains. Larger parameters encourages pdpyq to tend toward uniform. We draw from test, train, and valid datasets without replacement to match these distributions, but discard some examples due to class imbalance. Training and Evaluation The algorithm uses train and validation data consisting of pairs of images and domain indices. We train Res Net50 [35] (with added dropout) on images xi with domain indices di as the label, choose best iteration by valid loss, pass all training and validation data through pf, and cluster pushforward predictions pfpxiq into m ě k clusters with Faiss K-Means [42]. We compute the p Qcp Xq|D matrix and run NMF to obtain p Qcp Xq|Y , p QY |D. To make columns sum to 1, we normalize columns of p Qcp Xq|Y , multiply each column s normalization coefficient over the corresponding row of p QY |D (to preserve correctness of the decomposition), and then normalize columns of p QY |D. Some NMF algorithms only output solutions satisfying the anchor word property [4, 44, 30]. We found the strict requirement of an exact anchor word solution to lead to low noise tolerance. We therefore use the Sklearn implementation of standard NMF [15, 69, 59]. We instantiate the domain discriminator as Res Net18, and preseed its backbone with SCAN [72] pre-trained weights or [72] contrastive pre-text weights. We denote these models DDFA (SI) and DDFA (SPI) respectively. We predict class labels with Algorithm 2. With the Hungarian algorithm, implemented in [16, 74], we compute the highest true accuracy among any permutation of these labels (denoted Test acc ). With the same permutation, we reorder rows of p PY |D, then compute the average absolute difference between corresponding entries of p QY |D and QY |D (denoted QY |D err ). Results On CIFAR-10, we observe that DDFA alone is incapable of matching highly competitive baseline SCAN performance however, in suitably sparse problem settings (small α), it comes substantially close, indicating good recovery of true classes. Due to space constraints, we include CIFAR-10 results in App. E. DDFA (SI) combines SCAN s strong pretrain with domain discrimination fine-tuning to outperform SCAN in the easiest, sparsest setting and certain denser settings. On CIFAR-20, baseline SCAN is much less competitive, so our DDFA(SI) dominates baseline SCAN in all settings except the densest (Table 1). These results demonstrate how adding domain information can dramatically boost unsupervised baselines. On Field Guide-2, DDFA (SPI) outperforms SCAN baselines in accuracy across all but the densest settings (Table 5); in sparser settings, the accuracy gap is 10-30%. In this dataset, SCAN performs only slightly above chance, reflecting perhaps a total misalignment of learned class distinctions with true species boundaries. We do not believe that SCAN is too weak to effectively detect image groupings on this data; instead we acknowledge that the domain information available to DDFA (SPI) (and not to SCAN) is informative for ensuring recovery of the true class distinction between species as opposed to the visually striking distinction between adult and immature life stages. We do note that, oddly, SCAN tends to better recover the QY |D matrix in harder settings. On Field Guide-28 (Table 6), DDFA accuracy outperforms SCAN baseline accuracy in all evaluated sparsity levels, with the highest observed accuracy difference ranging above 30-40%. 7 Conclusion Our theoretical results demonstrate that under LLS, we can leverage shifts among previously seen domains to recover labels in a purely unsupervised manner, and our practical instantiation of the DDFA framework demonstrates both (i) the practical efficacy of our approach; and (ii) that generic unsupervised methods can play a key role both in clustering discriminator outputs, and providing weights for initializing the discriminator. We believe our work is just the first step in a new direction for leveraging structural assumptions together with distribution shift to perform unsupervised learning. 7.1 Future Work and Limitations Assumptions Our approach is limited by the set of assumptions needed (label shift, as many data domains as true latent classes, known true number of classes k, and other assumptions established in A.1-A.4). Future work should aim to relax these assumptions. Theory The work does not include finite sample bounds for the DDFA algorithm. In addition, we do not have a formal guarantee that the clustering heuristic in the Discretize step of DDFA will retrieve pure anchor sub-domain clusters under potentially noisy black-box prediction of qpd|xq. This problem is complicated by the difficulty of reasoning about the noise that may be produced by a neural network or other complex non-linear model (acting as the black-box domain discriminator), and by the lack of concrete guarantees that K-means will recover the anchor subdomains among its recovered clusters especially when classes overlap. DDFA Framework Within the LLS setup, several components of the DDFA framework warrant further investigation: (i) the deep domain discriminator can be enhanced in myriad ways; (ii) for clustering discriminator outputs, we might develop methods specially tailored to our setting to replace the current generic clustering heuristic; (iii) clustering might be replaced altogether with geometrically informed methods that directly identify the corners of the polytope; (iv) the theory of LLS can be extended beyond identification to provide statistical results that might hold when qpd|xq can only be noisily estimated, and when only finite samples are available for the induced topic modeling problem; (v) when the number of true classes k is unknown, we may develop approaches to estimate this k. Semi-synthetic Experiments Semi-synthetic experiments present an ideal environment for evaluating under the precise label shift assumption. Evaluating on datasets in which the separation into domains is organic, and the label shift is inherent is an interesting direction for future work. Acknowledgements SG acknowledges Amazon Graduate Fellowship and JP Morgan Ph D Fellowship for their support. ZL acknowledges Amazon AI, Salesforce Research, Facebook, UPMC, Abridge, the Pw C Center, the Block Center, the Center for Machine Learning and Health, and the CMU Software Engineering Institute (SEI) via Department of Defense contract FA8702-15-D-0002, for their generous support of ACMI Lab s research on machine learning under distribution shift. [1] Amr Alexandari, Anshul Kundaje, and Avanti Shrikumar. Adapting to label shift with biascorrected calibration. In International Conference on Machine Learning (ICML), 2019. [2] Eric Arazo, Diego Ortego, Paul Albert, Noel O Connor, and Kevin Mc Guinness. Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning, pages 312 321. PMLR, 2019. [3] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. [4] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization provably. In Symposium on Theory of Computing (STOC), 2012. doi: 10.1145/2213977.2213994. URL https://doi.org/10.1145/2213977.2213994. [5] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models going beyond svd. In Foundations of Computer Science (FOCS). IEEE, 2012. [6] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In International Conference on Learning Representations (ICLR), 2019. [7] Jessa Bekker and Jesse Davis. Estimating the class prior in positive and unlabeled data through decision tree induction. In Association for the Advancement of Artificial Intelligence (AAAI), volume 32, 2018. URL https://ojs.aaai.org/index.php/AAAI/article/view/11715. [8] Jessa Bekker and Jesse Davis. Learning from positive and unlabeled data: a survey. Machine Learning, 109(4):719 760, apr 2020. doi: 10.1007/s10994-020-05877-5. URL https://doi. org/10.1007%2Fs10994-020-05877-5. [9] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of Machine Learning research, 3(Jan):993 1022, 2003. [10] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132 149, 2018. [11] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Armand Joulin. Unsupervised pretraining of image features on non-curated data, 2019. URL https://www.cv-foundation. org/openaccess/content_iccv_2015/papers/Doersch_Unsupervised_Visual_ Representation_ICCV_2015_paper.pdf. [12] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning (ICML), pages 1597 1607. PMLR, 2020. [13] Yingyi Chen, Xi Shen, Shell Xu Hu, and Johan AK Suykens. Boosting co-teaching with compression regularization for label noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2688 2692, 2021. [14] Yinyin Chen, Shishuang He, Yun Yang, and Feng Liang. Learning topic models: Identifiability and finite-sample analysis. ar Xiv preprint ar Xiv:2110.04232, 2021. [15] Andrzej Cichocki and Anh-Huy Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Transactions, 92-A:708 721, 03 2009. doi: 10.1587/transfun. E92.A.708. [16] David F Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679 1696, 2016. [17] Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, and Amos J. Storkey. Cinic-10 is not imagenet or cifar-10, 2018. URL https://arxiv.org/abs/1810.03505. [18] Thiago de Paulo Faleiros and Alneu de Andrade Lopes. On the equivalence between algorithms for non-negative matrix factorization and latent dirichlet allocation. In European Symposium on Artificial Neural Networks (ESANN), 2016. [19] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Ieee, 2009. [20] Chris Ding, Tao Li, and Wei Peng. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis, 52(8): 3913 3927, 2008. [21] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In International Conference on Computer Vision (ICCV), 2015. URL https://www.cv-foundation.org/openaccess/content_iccv_2015/ papers/Doersch_Unsupervised_Visual_Representation_ICCV_2015_paper.pdf. [22] David Donoho and Victoria Stodden. When does non-negative matrix factorization give a correct decomposition into parts? Advances in Neural Information Processing Systems (Neur IPS), 16, 2003. [23] Marthinus C. du Plessis, Gang Niu, and Masashi Sugiyama. Class-prior estimation for learning from positive and unlabeled data. Machine Learning, 106(4):463 492, nov 2016. doi: 10.1007/ s10994-016-5604-6. URL https://doi.org/10.1007%2Fs10994-016-5604-6. [24] Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 213 220, 2008. [25] Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary Lipton. A unified view of label shift estimation. In Advances in Neural Information Processing Systems (Neur IPS), 2020. URL https://arxiv.org/abs/2003.07554. [26] Saurabh Garg, Yifan Wu, Alex Smola, Sivaraman Balakrishnan, and Zachary Lipton. Mixture proportion estimation and PU learning: A modern approach. In Advances in Neural Information Processing Systems (Neur IPS), 2021. URL https://arxiv.org/abs/2111.00980. [27] Saurabh Garg, Sivaraman Balakrishnan, and Zachary Lipton. Domain adaptation under open set label shift. In Advances in Neural Information Processing Systems (Neur IPS), 2022. [28] Eric Gaussier and Cyril Goutte. Relation between plsa and nmf and implications. In Conference on Research and Development in Information Retrieval (SIGIR), 2005. [29] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations, 2018. URL https://arxiv.org/abs/1803.07728. [30] Nicolas Gillis and Stephen A. Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):698 714, apr 2014. doi: 10.1109/tpami.2013.226. URL https://doi.org/10.1137% 2F130946782. [31] Mark Girolami and Ata Kabán. On an equivalence between plsi and lda. In Conference on Research and Development in Information Retrieval (SIGIR), 2003. [32] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In International Conference on Learning Representations, 2017. [33] Jiaxian Guo, Mingming Gong, Tongliang Liu, Kun Zhang, and Dacheng Tao. Ltf: A label transformation framework for correcting label shift. In International Conference on Machine Learning, pages 3843 3853. PMLR, 2020. [34] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in Neural Information Processing Systems (Neur IPS), 31, 2018. [35] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [36] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729 9738, 2020. [37] Thomas Hofmann. Probabilistic latent semantic indexing. In Conference on Research and Development in Information Retrieval (SIGIR), 1999. [38] Kejun Huang, Xiao Fu, and Nikolaos D Sidiropoulos. Anchor-free correlated topic modeling: Identifiability and algorithm. In Advances in Neural Information Processing Systems (Neur IPS), 2016. [39] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411 430, 2000. [40] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411 430, 2000. [41] Dmitry Ivanov. DEDPUL: Difference-of-estimated-densities-based positive-unlabeled learning. ar Xiv preprint ar Xiv:1902.06965, 2019. [42] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535 547, 2019. [43] Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, Citeseer, 2009. [44] Abhishek Kumar, Vikas Sindhwani, and Prabhanjan Kambadur. Fast conical hull algorithms for near-separable non-negative matrix factorization. In International Conference on Machine Learning, pages 231 239. PMLR, 2013. [45] Junnan Li, Richard Socher, and Steven C.H. Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. In International Conference on Learning Representations, 2020. [46] Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning (ICML). PMLR, 2018. [47] Sheng Liu, Jonathan Niles-Weed, Narges Razavian, and Carlos Fernandez-Granda. Earlylearning regularization prevents memorization of noisy labels. Advances in Neural Information Processing Systems (Neur IPS), 33:20331 20342, 2020. [48] Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You. Robust training under label noise by overparameterization. ar Xiv preprint ar Xiv:2202.14026, 2022. [49] Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 38(3):447 461, mar 2016. doi: 10.1109/tpami.2015.2456899. URL https://doi.org/10.1109%2Ftpami.2015. 2456899. [50] James Mac Queen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pages 281 297. Oakland, CA, USA, 1967. [51] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning (ICML). PMLR, 2013. [52] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in Neural Information Processing Systems (Neur IPS), 26, 2013. [53] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69 84. Springer, 2016. [54] Pentti Paatero and Unto Tapper. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111 126, 1994. [55] Daniel Paleka and Amartya Sanyal. A law of adversarial risk, interpolation, and label noise. ar Xiv preprint ar Xiv:2207.03933, 2022. [56] Christos H Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. Journal of Computer and System Sciences, 61(2): 217 235, 2000. [57] Sungwon Park, Sungwon Han, Sundong Kim, Danu Kim, Sungkyu Park, Seunghoon Hong, and Meeyoung Cha. Improving unsupervised image clustering with robust learning. Co RR, abs/2012.11150, 2020. URL https://arxiv.org/abs/2012.11150. [58] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944 1952, 2017. [59] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (JMLR), 12, 2011. [60] Kedar S Prabhudesai, Boyla O Mainsah, Leslie M Collins, and Chandra S Throckmorton. Augmented latent dirichlet allocation (lda) topic model with gaussian mixture topics. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2451 2455. IEEE, 2018. [61] Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. In Yoshua Bengio and Yann Le Cun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6596. [62] Douglas A Reynolds. Gaussian mixture models. Encyclopedia of biometrics, 741(659-663), 2009. [63] Alexander Ritchie, Robert A Vandermeulen, and Clayton Scott. Consistent estimation of identifiable nonparametric mixture models from grouped observations. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 11676 11686. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ 866d90e0921ac7b024b47d672445a086-Paper.pdf. [64] Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation, 2002. [65] Clayton Scott. A Rate of Convergence for Mixture Proportion Estimation, with Application to Learning from Noisy Labels. In Artificial Intelligence and Statistics (AISTATS), 2015. URL https://proceedings.mlr.press/v38/scott15.html. [66] D Seung and L Lee. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (Neur IPS), 2001. [67] Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning, pages 5739 5748. PMLR, 2019. [68] Amos Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, 30:3 28, 2009. [69] Vincent YF Tan and Cédric Févotte. Automatic relevance determination in nonnegative matrix factorization with the/spl beta/-divergence. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35(7), 2012. [70] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5552 5560, 2018. [71] Dongping Tian. Research on plsa model based semantic image analysis: A systematic review. J. Inf. Hiding Multim. Signal Process., 9(5):1099 1113, 2018. [72] Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels, 2020. URL https://arxiv. org/abs/2005.12320. [73] Robert A. Vandermeulen and Clayton D. Scott. An operator theoretic approach to nonparametric mixture models. The Annals of Statistics, 47(5):2704 2733, 2019. doi: 10.1214/18-AOS1762. URL https://doi.org/10.1214/18-AOS1762. [74] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vander Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. doi: 10.1038/s41592-019-0686-2. [75] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37 52, 1987. [76] Jianxin Zhang, Yutong Wang, and Clayton Scott. Learning from label proportions by learning with label noise. ar Xiv preprint ar Xiv:2203.02496, 2022. [77] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning (ICML). PMLR, 2013. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] We believe that this work, which proposes a novel unsupervised learning problem does not present a significant societal concern. While this could potentially guide practitioners to improve classification, we do not believe that it will fundamentally impact how machine learning is used in a way that could conceivably be socially salient. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] See Sec. 3 (b) Did you include complete proofs of all theoretical results? [Yes] See Sec. 4 and appendices 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We include all the necessary details to replicate our experiments in appendices. Code URL is given in the paper. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Yes, we describe crucial details in Sec. 6 and defer precise details to appendices. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We include error bars for all main paper experiments and ablations. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Refer to experimental setup in App. D. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] Refer to experimental setup in App. D. (b) Did you mention the license of the assets? [Yes] Refer to experimental setup in App. D. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] Refer to experimental setup in App. D. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]