# bayesian_quantification_with_blackbox_estimators__8a5464d9.pdf

Published in Transactions on Machine Learning Research (06/2024)

Bayesian Quantification with Black-Box Estimators

Albert Ziegler albert@xbow.com XBOW, Head of AI Uppsala, Sweden

Paweł Czyż pawelpiotr.czyz@ai.ethz.ch ETH AI Center and Department of Biosystems Science and Engineering ETH Zürich Zürich, Switzerland

Reviewed on Open Review: https: // openreview. net/ forum? id= Ft4k Hr Oaw Z

Understanding how different classes are distributed in an unlabeled data set is important for the calibration of probabilistic classifiers and uncertainty quantification. Methods like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary and potentially biased black-box classifier trained on a different data set to estimate the class distribution on the current data set and yield asymptotic guarantees under weak assumptions. We demonstrate that these algorithms are closely related to the inference in a particular probabilistic graphical model approximating the assumed ground-truth generative process, and we propose a Bayesian estimator. Then, we discuss an efficient Markov chain Monte Carlo sampling scheme for the introduced model and show an asymptotic consistency guarantee in the large-data limit. We compare the introduced model against the established point estimators in a variety of scenarios, and show it is competitive, and in some cases superior, with the non-Bayesian alternatives.

1 Introduction

Consider a medical test predicting illness (classification label Y ), such as influenza, based on symptoms (features X). This often can be modeled as an anti-causal problem1 (Schölkopf et al., 2012), where Y causally affects X. Under the usual i.i.d. assumption, one can approximate the probabilities P(Y | X) using a large enough training data set.

However, the performance on real-world data may be lower than expected, due to data shift: the issue that real-world data comes from a different probability distribution than training data. For example, wellcalibrated classifiers trained during early stages of the COVID-19 pandemic will underestimate the incidence of the illness at the time of surge in infections.

The paradigmatic case of data shift is prior probability shift, where the context (e.g., training and test phase) influences the distribution of the target label Y , although the generative mechanism generating X from Y is left unchanged. In other words, Ptrain(X | Y ) = Ptest(X | Y ), although Ptrain(Y ) may differ from Ptest(Y ). If Ptest(Y ) is known, then Ptest(Y | X) can be calculated by rescaling Ptrain(Y | X) according to Bayes theorem (see Saerens et al. (2001, Sec. 2.2) or Schölkopf et al. (2012, Sec. 3.2)). However, Ptest(Y ) is usually unknown and needs to be estimated having access only to a finite sample from the distribution Ptest(X). This task is known as quantification (González et al., 2017; Forman, 2008).

Although quantification found applications in adjusting classifier predictions, it is an important problem on its own. For example, imagine an inaccurate but cheap COVID-19 test, which can be taken by a

1While influenza causes high fever, in many medical problems the causal relationships are much more complex (Castro et al., 2020).

Published in Transactions on Machine Learning Research (06/2024)

Y j π X j C j

(Nk) (Flk) π

Figure 1: Left: High-dimensional model Mtrue. Dashed arrows are used to create low-dimensional representations Ci and C j using a given black-box function f. Filled nodes represent observed random variables. The top row represents the labeled data set and the bottom row represents the unlabeled data set. Middle: Tractable approximation Mapprox (Sec. 3). Right: Model Msmall enabling fast inference when f is a black-box classifier or a clustering algorithm (Sec. 3.2).

significant fraction of the population on a weekly basis. While this test may not be sufficient to determine whether a particular person is contagious, the estimate of the true number of positive cases could be used by epidemiologists to monitor the reproduction number and by the health authorities to inform public policy.2

We advocate treating the quantification problem using Bayesian modeling, which provides uncertainty around the Ptest(Y ) estimate. This uncertainty can be used directly if the distribution on the whole population is of interest, or it can be used to calibrate a probabilistic classifier to yield a more informed estimate for the label of a particular observation.

A Bayesian approach was already proposed by Storkey (2009, Sec. 6). However, that proposal relies on a generative model P(X | Y ), which can be misspecified or computationally intractable in high-dimensional settings. Hence, quantification is usually approached either via the expectation-maximization (EM) algorithm (Saerens et al., 2001) or a family of closely-related algorithms known as invariant ratio estimators (Vaz et al., 2019), black-box shift estimators (Lipton et al., 2018), or adjusted classify and count (Forman, 2008), which replace the generative model P(X | Y ) with a (potentially biased) classifier. Tasche (2017), Lipton et al. (2018), and Vaz et al. (2019) proved that these algorithms are asymptotically consistent (they rediscover Ptest(Y ) in the limit of infinite data) under weak assumptions and derived asymptotic bounds on the related error.

Our contributions are:

1. We show a connection between the quantification algorithms employing a black-box (and potentially biased) classifier with Bayesian inference in a probabilistic graphical model approximating the ground-truth generative process.

2. We present a tractable Bayesian approach, which is well-suited for low-data problems. Established alternatives provide asymptotic estimates on error bounds, but may be far off for small samples (to the point that some of the estimates for Ptest(Y ) may be negative). The Bayesian approach explicitly quantifies the uncertainty and does not suffer from the negative values problem. Moreover, it is possible to incorporate expert s knowledge via the choice of the prior distribution.

3. We prove that the maximum a posteriori inference in the considered model is asymptotically consistent under weak assumptions.

2 The quantification problem and existing solutions

Consider a classification problem with Y = {1, 2, . . . , L} labels and observed features coming from a measurable space X. A given object is then represented by two random variables (r.v.): a Y-valued r.v. representing

2Note however that testing strategies may be adapted to the outbreaks, which in turn induce correlations between observed data, violating the usual assumption that the data are exchangeable. We discuss contraindications in Sec. 5.

Published in Transactions on Machine Learning Research (06/2024)

the label and an X-valued r.v. representing the measured features. We consider an anti-causal problem in which there exists a probabilistic mechanism P(X | Y ), responsible for generating the features from the label. We focus on two populations sharing the same generative mechanism, assuming that there exist probability distributions Ptrain(X, Y ) and Ptest(X, Y ) over X Y such that Ptest(X | Y = y) = Ptrain(X | Y = y) for every y Y. In the literature this assumption is referred to as prior probability shift (Storkey, 2009). We will write K y for the conditional distribution P(X | Y = y), which is the generative mechanism shared by both populations.

The quantification problem (González et al., 2017) asks whether given finite i.i.d. samples from the distributions Ptrain(X, Y ) and Ptest(X) it is possible to determine the distribution Ptest(Y ). In principle, if the data are abundant, one can use samples from Ptrain(X, Y ) to determine the conditional distributions K y and then find all probability vectors Ptest(Y ) which are compatible with the distribution of the features Ptest(X), written as a mixture distribution of generative mechanisms K y:

Ptest(X) = X

y Y Ptest(Y = y) K y. (1)

The uniqueness of the vector Ptest(Y ) follows under strict linear independence assumption of measures K y (Garg et al., 2020). We review the notion of strict linear independence in Appendix A, but for a finite discrete space X it reduces to the linear independence of probability vectors K y, which allows constructing the left inverse of the P(X | Y ) matrix.

In practice, however, it is not possible to fully reconstruct the distributions Ptest(X) and K y from finite samples and principled statistical approaches are needed. To formalize the problem, consider a probabilistic graphical model Mtrue in Fig. 1: probability vectors Ptrain(Y ) and Ptest(Y ) are modeled with r.v. π and π

valued in the probability simplex L 1 = n y (0, 1)L : y1 + + y L = 1 o and the ground-truth generative

processes K y are modeled via parametric distributions Ky( ; θ) with a parameter vector θ. We observe N pairs of r.v. (Xi, Yi) for i {1, . . . , N} sampled independently according to the model

Yi | π Categorical(L, π), Xi | Yi, θ KYi( ; θ).

Additionally, we observe N r.v. X j for j {1, . . . , N } sampled independently from the mixture distribution

X j | π , θ

y=1 π y Ky( ; θ),

or, if latent variables Y j valued in Y are introduced,

Y j | π Categorical(L, π ), X j | Y j , θ KY j ( ; θ).

2.1 Likelihood-based methods

Our work draws on two major groups of quantification methods, with the description of other approaches deferred to Appendix D. The first group proceeds by considering a generative probabilistic model of the data.

Peters & Coberly (1976) assume that each Ky( ; θ) is a multivariate normal distribution. Then, they estimate θ using labeled data {Xi, Yi} and find the maximum likelihood solution for π by an iterative optimization algorithm. Storkey (2009) discusses approaching quantification problems within a fully Bayesian estimation, which requires marginalization of θ, and notices that such marginalization may not generally be tractable for complex generative models Ky( ; θ). Moreover, a tractable generative model of high-dimensional data is likely to be misspecified, which may compromise Bayesian inference (Watson & Holmes, 2016; Lyddon et al., 2018).

Saerens et al. (2001) observe that specifying high-dimensional distributions Ky( ; θ) may be avoided if one instead has access to an oracle probabilistic classifier r: X L 1 such that each r(x) = P(Yi | Xi = x, π).

Published in Transactions on Machine Learning Research (06/2024)

Then, they show how to use a candidate value π to recalibrate r(x) and marginalize latent variables Y j in the Expectation-Maximization (EM) manner, which iteratively optimizes π , targeting the maximum likelihood estimate. In Appendix D.1 we give a detailed treatment of this algorithm, together with two simple extensions: when a Dirichlet prior is used for P(π ), EM targets the maximum a posteriori (MAP) estimate of the posterior distribution P(π | {X j}, r). Moreover, we describe a Gibbs sampler allowing drawing from the posterior P(π | {X j}, r).

However, this posterior is generally different than P(π | {X j}, {Xi, Yi}), which requires marginalization over r, and the performance of the EM algorithm depends on the access to the oracle classifier r, which has to be well-calibrated (Garg et al., 2020). As modern classification methods, such as neural networks, are often miscalibrated (Guo et al., 2017), one has to leverage the labeled data set {Xi, Yi} to improve calibration. Alexandari et al. (2020) introduces calibration methods which can yield the state-of-the-art results using the expectation-maximization algorithm.

Although both approaches are based in sound likelihood framework, Peters & Coberly (1976) require learning high-dimensional generative models Ky(X; θ) = P(X | Y = y, θ) and Saerens et al. (2001) assume an access to a well-calibrated oracle classifier P(Yi | Xi, π). Moreover, each iteration of all mentioned approaches requires operations involving all N variables X j. This may limit the scalability of either algorithm to large data sets.

2.2 Methods involving an auxiliary black-box classifier

The second group of approaches is based around a modification of Eq. 1 and assumes access to a given auxiliary black-box mapping: consider a given measurable space C and a measurable mapping3 f : X C. For example, f can be a pretrained feature extractor (such as a large language model), clustering algorithm, or a generic classifier, trained on a large data set with possibly a different set of categories.

Then, one can define new observed r.v. Ci = f(Xi) and C j = f(X j), which in Fig. 1 corresponds to the part of the diagram with dashed arrows. Note that the new variables act only as a summary statistic and do not increase the amount of information available. Namely, given {(Xi, Yi)} and {X j}, the r.v. π is independent of {Ci} and {C j}, i.e.: π {Ci}, {C j} {(Xi, Yi)}, {X j}.

However, the prior probability shift assumption implies that the distributions P(Ci | Yi = y) = P(C j | Y j = y) are equal for an arbitrary label y Y and indices i, j. In particular, Eq. 1 can be used with original features X replaced with the newly introduced representations C = f(X). As they are of lower dimension than X, it may be easier to approximate required probabilities with the available data samples.

For example, Vaz et al. (2019) propose invariant ratio estimators, which generalize earlier approaches of adjusted classify and count (Forman, 2008; Tasche, 2017) and its variant introduced by Bella et al. (2010). Namely, for a given mapping f : X RL 1, one constructs

j C j, ˆF:,y = 1 |Sy|

i Sy Ci, where Sy = {i {1, . . . , N} : Yi = y},

and solves the set of equations given by ˆf = ˆFπ and π 1 + + π L = 1. In Appendix D.2 we review the closely-related algorithms employing black-box classifiers and based on matrix inversion (solving a set of linear equations), including the popular algorithm of Lipton et al. (2018).

Estimators employing black-box classifiers offer four advantages over likelihood-based methods. First, as auxiliary mapping f can produce low-dimensional representations, estimating probabilities appearing in Eq. 1 may be more accurate. Secondly, Peters & Coberly (1976) require training a potentially high-dimensional generative model and Saerens et al. (2001) require a well-calibrated oracle probabilistic classifier, which may be hard to obtain in practice. Third, each optimization step in a likelihood-based method requires O(N ) operations. Black-box method f has to be applied only once to each X j to construct the summary statistic, which is then used for solving a linear set of equations (cf. Eq. 1). Finally, even when P(X | Y ) is not invariant

3Although for the clarity of the exposition we will use a notation corresponding to a measurable function f, the results hold mutatis mutandi for an arbitrary Markov kernel (Klenke, 2014, Sec. 8.3), so that f does not need to be deterministic.

Published in Transactions on Machine Learning Research (06/2024)

(i.e., the prior probability shift assumption does not hold), for an appropriate dimension reduction method f it may hold that the distribution of low-dimensional representations, P(C | Y ), is invariant (Arjovsky et al., 2019). Lipton et al. (2018) calls invariance of P(C | Y ) the weak prior probability shift assumption.

However, at the same time methods employing black-box dimension reduction methods f have three undesirable properties. First, dimension reduction methods may incur loss of information (Fedorov et al., 2009; Harrell, 2015, Sec. 1.3). In particular, even if the ground-truth distributions K y = P(X | Y = y) are strictly linearly independent, the pushforward distributions P(C | Y = y) do not have to be. Secondly, solving Eq. 1 requires approximating probability distributions basing on the laws of large numbers: although these methods have desirable asymptotic behaviour, likelihood-based methods explicitly work with a given finite sample. Finally, solving a linear set of equations is not numerically stable when P(C | Y ) has large condition number. In the next section we show how to solve the last two issues within our proposed Bayesian framework.

3 Bayesian quantification with black-box shift estimators

We work in the setting of Fig. 1 with N labeled examples (X1, Y1), . . . , (XN, YN) and N unlabeled examples X 1, . . . , X N obtained under the prior probability shift assumption. Additionally, we assume that we work with a given dimension reduction mapping f : X C. A fully Bayesian treatment (Storkey, 2009) relies on an assumed parametric generative mechanism Ky(X; θ) = P(X | Y = y, θ) and marginalizes over all possible values of parameter θ to obtain the values of the latent variables π and π . From the graphical structure in Fig. 1 we note that the posterior factorizes as

P(π , π | {(Xi, Yi)}, {X j}) = P(π | {Yi}) P(π | {X j}, {(Xi, Yi)}),

and P(π | {X j}, {(Xi, Yi)}) P(π ) Z Y

i KYi(Xi; θ) Y

y π y Ky(X j; θ) d P(θ). (2)

The posterior P(π | {Yi}) is analytically tractable when a Dirichlet prior P(π) is used, so the difficulty in quantification relies in finding P(π | {X j}, {(Xi, Yi)}). If θ is of moderate dimension, this distribution can be approximated by using Markov chain Monte Carlo (MCMC) algorithms (Betancourt, 2017) by jointly sampling π and θ from the posterior P(π , θ | {(Xi, Yi)}, {X j}) and retaining only the π component. However, in complex problems involving high-dimensional θ variables and large sample sizes N and N , MCMC methods become computationally expensive, which may limit their applicability (Betancourt, 2015; Bardenet et al., 2017; Izmailov et al., 2021). Moreover, if parametric kernels Ky( ; θ) are misspecified, which is arguably often the case in high-dimensional problems, the resulting inference may be compromised (Watson & Holmes, 2016; Lyddon et al., 2018); we investigate this issue in Sec. 4.3.

Both tractability and robustness to model misspecification can be simultaneously addressed by employing the provided black-box feature extractor f to replace Xi with Ci and X j with C j: Lewis et al. (2021) propose to improve robustness to model misspecification in regression models by conditioning on an insufficient summary statistic, rather than original data. In our case, we consider the conditional distribution

P(π | {C j}, {(Ci, Yi)}) P(π ) Z Y

i KYi(Ci; φ) Y

y π y Ky(C j; φ) d P(φ), (3)

where Ky( ; φ) are distributions on the low-dimensional space C, rather than on high-dimensional space X, parameterized by vector φ. Although it is possible to take φ = θ and define K( ; φ) to be the pushforward measure of K( ; θ) by the dimension reduction method f, we generally hope that a low-dimensional distribution K( ; φ) may require fewer parameters and φ will be of a much lower dimension than θ, making the integral from Eq. 3 more tractable than Eq. 2.

Apart from improved tractability, conditioning on summary statistic may improve the robustness due to easier specification of low-dimensional distributions K( ; φ). Finally, even if the prior probability assumption does

Published in Transactions on Machine Learning Research (06/2024)

not hold, i.e., P(X | Y ) is not invariant, the distribution of low-dimensional representations P(C | Y ) may be invariant (Arjovsky et al., 2019), which in notation of Lipton et al. (2018) corresponds to the weak prior probability shift assumption. On the other hand, conditioning on an insufficient statistic loses information: the trivial approximation C = {1} and f(x) = 1 forgets any available information and results in the posterior being the same as the prior, P(π | {Ci, Yi}, {C j}) = P(π ), even in the limit of infinite data.

Although the outlined methodology of approximating the intractable inference with a simpler model with a given black-box dimension reduction method f is general, below we analyse the simplest possible model, where C = {1, 2, . . . , K} and f is given by a black-box classifier, or a clustering algorithm, trained on a potentially very different data set.

3.1 The discrete model

Consider C = {1, 2, . . . , K} and a given black-box function f : X C. For example, f can be a miscalibrated classification algorithm trained on an entirely different data set (in particular, it is possible that K = L) or a function assigning points to predefined clusters. If K < L, we are not able to identify Ptest(Y ) basing on the outputs of f. However, as we find the full Bayesian posterior, rather than providing a point estimate based on matrix inversion, the posterior distribution on π may shrink along specific dimensions, providing accurate estimate of class prevalence for several classes. On the other hand, if K L and there is a strong enough correlation between outputs of f and ground-truth labels, the guarantees on methods employing black-box shift classifiers and matrix inversion (Tasche, 2017; Lipton et al., 2018; Vaz et al., 2019) ensure identifiability of the prevalence vector π in the large data limit, as we will see in Theorem 3.1.

In this case, the model Ky( ; φ) is particularly simple independently of the true data-generating process K y: as each of Ky( ; φ) distributions is supported on a finite set C, they have to be categorical distributions. Namely, φ = (φyk) is a matrix modeling the ground-truth probability table P(C = k | Y = y) and the model will not be misspecified provided that the weak prior probability shift assumption holds and that the prior on φ, π and π is positive on the simplices K 1 (for each φy:) and L 1 (for π and π ).

The approximate model Mapprox takes the form

π, π , φ P(π, π , φ)

Yi | π Categorical(L, π), Ci | Yi, φ Categorical(K, φYi:),

Y j | π Categorical(L, π ), C j | Y j , φ Categorical(K, φY j :).

We do not put specific requirements on the prior P(π, π , φ) other than being positive on the probability simplices: for example, the Dirichlet distribution or the logistic normal distribution can be used. If precise problem-specific information is not available, we recommend using weakly informative prior distributions (Gelman et al., 2013, Sec. 2.9), such as the uniform prior over each simplex. However, we note that if the Dirichlet priors are used, the model conceptually resembles a very low-dimensional variant of latent Dirichlet allocation (Pritchard et al., 2000; Blei et al., 2003), with observed r.v. Yi and Ci providing information on the φ matrix. In Section 3.2 we will show how to construct a scalable sufficient statistic and perform efficient inference using Hamiltonian Markov chain Monte Carlo methods (Betancourt, 2017).

Although for K < L the model is not identifiable, for K L it shares asymptotic properties similar to the alternatives based on matrix inversion. In Appendix C we prove the following result:

Theorem 3.1. Assume that:

1. The weak prior probability shift assumption holds, i.e., Ptrain(C | Y ) = Ptest(C | Y ) is invariant between the populations.

2. The ground-truth matrix φ = P(C = k | Y = y)

yk is of rank L and all entries are strictly positive.

3. The ground-truth prevalence vectors π = Ptrain(Y ) and π = Ptest(Y ) have only strictly positive entries.

Published in Transactions on Machine Learning Research (06/2024)

4. The prior P(π, π , φ) is continuous and strictly positive on the whole space L 1 L 1 K 1 K 1 .

Then, for every δ > 0 and ε > 0, there exist N and N large enough such that with probability at least 1 δ the maximum a posteriori estimate ˆπ, ˆπ , ˆφ is in the ε-neighborhood of the true parameter values π , π , φ .

Compared to the traditional approaches, we do not explicitly invert the matrix P(C | Y ) (modeled with φ), as any degeneracy is simply reflected in the posterior, showing that we did not learn anything new about the prevalence of some classes. However, if the full-rank condition holds, the maximum a posteriori estimate asymptotically recovers the true parameters. This result is similar to the standard results on methods employing matrix inversion, which typically assume conditions 1 3. As an additional condition, we require that the prior is positive on the whole space, which ensures that the ground-truth parameters lie in the region with positive density. This result in conceptually similar to the classical Bernstein von Mises theorem linking Bayesian and frequentist inference in the large data limit, and in particular depends on the assumption that the model is not misspecified. However, we stress that in Bayesian statistics one should use the whole posterior distribution, rather than relying on a single point estimate. In Appendix C.1 we provide a further discussion of this point.

3.2 Fast inference in the discrete model

One advantage of methods employing black-box classifiers and matrix inversion over likelihood-based methods is that they require O(N + N ) computation time to preprocess the data, and then estimate is generated by matrix inversion, which is polynomial in K and L. Likelihood-based methods, however, require generally at least O(N ) operations per each likelihood evaluation. As such, these methods may not be scalable enough to large data sets or when extensive resampling methods, such as bootstrap (Efron, 1979; Tasche, 2019), are required.

For the discrete Bayesian model, however, it is possible to preprocess the data in O(N + N ) time and then evaluate the likelihood and its gradient in polynomial time in terms of K and L, without any further dependence on N or N . In this section we show how to construct a sufficient statistic for π, π , and φ, whose size is independent on N and N and allows one to perform efficient inference with existing sampling methods.

Define a K-tuple (N k)k C of r.v. summarizing the unlabeled data set by N k = {j {1, . . . , N } : C j = k} , which can be constructed in O(K) memory and O(N ) time. Then, for each y Y, we define a K-tuple of r.v. (Fyk)k C, such that Fyk = {i {1, . . . , N} : Yi = y and Ci = k} , which requires O(LK) memory and O(N) time. Finally, we define an L-tuple of r.v. (Ny)y Y by Ny = Fy1 + + Fy K.

In Appendix B we prove that the likelihood P({Yi, Ci}, {C j} | π, π , φ) is proportional to the likelihood P (Ny), (N k), (Fyk) | π, π , φ in a smaller model, Msmall:

(Ny) | π Multinomial(N, π),

(Fy:) | Ny, φ Multinomial(Ny, φy:),

(N k) | π , φ Multinomial(N , φT π ).

Hence, by the factorization theorem of Halmos & Savage (1949), we constructed a sufficient statistic for the inference of π, π , φ, whose size is independent of N and N . In turn, we can use the likelihood of Msmall (rather than Mapprox) to sample π, π and φ from the posterior, allowing us to perform each likelihood evaluation in O(KL), rather than O(N + N ), time. Moreover, the gradient of the likelihood is available, so we can use any of the existing Hamiltonian Markov chain Monte Carlo algorithms (Betancourt, 2017; Hoffman & Gelman, 2014).

4 Experimental results

We evaluate the proposed method in four aspects. In Sec. 4.1 we analyze the benefits of using Bayesian approach, rather than matrix inversion, in problems which are identifiable, but where the matrix P(C | Y )

Published in Transactions on Machine Learning Research (06/2024)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

Figure 2: Uncertainty of the prevalence estimates in the nearly non-identifiable model. Samples from the proposed Bayesian posterior quantify uncertainty better than bootstrapping point estimators based on matrix inversion.

has a large condition number, requiring a large number of samples to be estimated accurately. In Sec. 4.2 we compare how the posterior mean compares with the existing point prediction methods employing black-box classifiers using extensive simulated data sets. In Sec. 4.3 we investigate how the posterior in approximate model, P(π | {Yi, Ci}, {C j}), compares to the posterior P(π | {Yi, Xi}, {X j}) with properly specified, as well as misspecified, generative models. Finally, in Sec. 4.4 we present an application of quantification methods to the problem of estimating cell type prevalence from single-cell RNA sequencing data. In all settings, we assume no specific knowledge of the problem (which could be used by principled prior elicitation) and use the uniform distributions as the priors for π, π , and φy: vectors.

4.1 Tighter estimation for hard-to-identify model

The Bayesian inference does not require an explicit inversion of the P(C | Y ) matrix. We therefore hypothesise that it may be preferable in cases where P(C | Y ) has a large condition number. Hence, in this section we consider a case with L = K = 3 and a given black-box classifier which can only weakly distinguish between classes 2 and 3, i.e., the ground-truth matrix P(C | Y ) is given by

φ = (φ yk) =

0.96 0.02 0.02 0.02 0.50 0.48 0.02 0.48 0.50

Although the matrix is full-rank (and asymptotic identifiability for all methods employing black-box classifiers holds), having an access to a finite sample may limit practical ability to accurately estimate the prevalence.

We simulated a data set with N = N = 1000 data points and ground-truth prevalence vectors π = (1/3, 1/3, 1/3) and π = (0.5, 0.35, 0.15). In Fig. 2 we plot posterior samples from the proposed Bayesian model together with the bootstrapped (Efron, 1979) predictions of three methods employing black-box classifiers and performing explicit inversion: restricted and unrestricted invariant ratio estimators (RIR and UIR respectively; Vaz et al. (2019)) and black-box shift estimator of Lipton et al. (2018) (BBS). We used S = 100 bootstrap samples using the stratified bootstrapping procedure introduced for quantification problems by Tasche (2019); in Appendix E we additionally reproduce this experiment varying the sampled data sets.

We see that the Bayesian approach identifies the component π 1, leaving large uncertainty on entries π 2 and π 3. On the other hand, all bootstrapped methods struggle with estimating any of the components. Moreover, UIR and BBS often result in estimates with negative entries. As we further illustrate in Appendix E, this behaviour is typical for low sample sizes, and bootstrapped predictions become stable for N = N = 104

samples. However, the proposed Bayesian approach does not suffer from these low-data issues, appropriately quantifying uncertainty.

Published in Transactions on Machine Learning Research (06/2024)

0.6 0.8 Prevalence 1

Unlabeled sample size N

0.6 0.8 Classifier quality q

BAY BBS CC RIR

3 5 7 9 Classifier outputs K

3 5 7 9 Number of labels L = K

0.6 0.8 Misspecified quality q

Figure 3: Mean absolute error for quantification using simulated categorical black-box classifiers under different scenarios, the proposed Bayesian approach (BAY) shown in blue.

4.2 Simulations employing the discrete model

Although we advocate for quantifying uncertainty around the prediction of π , quantification is typically posed as a point estimation problem. We therefore compare the point predictions of three matrix-inversion methods mentioned before (RIR, UIR, and BBS) with the posterior mean in the Bayesian model (BAY) and a simple baseline known as classify and count (CC; Tasche (2017) proves that this method may not converge to the ground-truth value even in the high-data limit).

Experimental design In this paragraph we describe the experimental design with the default parameter values. They are changed one at a time, as described in further sections, and for each setting we simulate labeled and unlabeled data sets S = 50 times and, for each method, we record the mean absolute error (MAE) between the ground-truth value π and the point estimate ˆπ . Using root mean squared error (RMSE) does not qualitatively change the results (see Appendix E.2).

We fix the data set sizes N = 103 and N = 500 and use L = K = 5 as a default setting. The ground-truth prevalence vectors are parametrized as π = (1/L, . . . , 1/L) and π (r) = r, 1 r

L 1, . . . , 1 r

L 1 . By default, we

use r = 0.7. The ground-truth matrix P(C | Y ) is parameterized as φ yy = q and φ yk = (1 q)/(K 1) for k = y and K L, with the default value q = 0.85. Whenever K < L we use φ yk = 1/K for y {L + 1, L + 2, . . . , K} to obtain a valid probability vector.

Changing prevalence We investigate the impact of increasing the prior probability shift (the difference between π and π ) by changing r = π 1 {0.5, 0.6, 0.7, 0.8, 0.9} and summarize the results in Fig. 3a. CC is adversely impacted by a strong data shift. The other estimators all perform similar to each other.

Changing data set size We investigate whether the algorithms converge to the ground-truth value in the large data limit. We vary N {10, 50, 100, 500, 103, 104}. As shown in Fig. 3b, the large data limit appears very similar (except for CC), agreeing with asymptotic identifiability guarantees for BBS, RIR and our MAP estimates, although BBS appears slightly less accurate than the others in a low data regime.

Changing classifier quality We investigate the impact of classifier quality by changing it in range q {0.55, 0.65, 0.75, 0.85, 0.95} and show the results in Fig. 3c. All considered method converge to zero error for high quality, but the convergence of CC is much slower than for the other algorithms.

Changing the classification granularity We change K {2, 3, 5, 7, 9}, creating a setting when a given black-box classifier, trained on a different data distribution, is still informative about some of the classes, but provides different information. In particular, the CC estimator cannot be used for K = L. Although the original formulation of BBSE and IR assumes K = L, we proceed with least square error solution. Our choice

Published in Transactions on Machine Learning Research (06/2024)

4 2 0 2 4 x

Y = 0 Y = 1

4 2 0 2 4 x

Ptrain(X) and Ptest(X)

0.0 0.2 0.4 Ptest(Y = 1)

Posterior samples

0.00 0.25 0.50 0.75 1.00 Expected

Gaussian Student Discrete (5 bins) Discrete (10 bins)

Figure 4: First two plots: conditional Student distributions P(X | Y ) and the training and test populations. Third plot: posterior samples under four different Bayesian models. Fourth plot: coverage of credible intervals.

of φ given above guarantees that the classifier for K > L = 5 will contain at least as much information as a classifier with a smaller number of classes. Conversely for K < L, the information about some of the classes will be insufficient even in the large data regime it is not possible for the matrix P(C | Y ) to have rank L, and asymptotic consistency does not generally hold.

The results are shown in Fig. 3d. While all methods considered (apart from CC) suffer little error for K L, we note that our model-based approach can still learn something about the classes for which the classifier is informative enough, while the techniques based on matrix inversion are less effective. Additionally, we should stress that the Bayesian approach gives the whole posterior distribution on π (which can still appropriately shrink along well-recoverable dimensions).

Changing the number of classes Finally, we jointly change L = K {2, 3, 5, 7, 9}. We plot the results in Fig. 3e. Again, classify and count obtains markedly worse results, with smaller differences between the other methods.

Model misspecification Finally, we study whether the considered approaches are robust to breaking the weak prior probability shift assumption: the unlabeled data are sampled according to a different P(C | Y ) distribution, parameterized by q . The weak prior probability shift assumption corresponds to the setting q = q, which is marked with a dashed black line in Fig. 3f. Although in this case asymptotic identifiability guarantees do not hold, we believe this to be an important case which may occur in practice (when additional distributional shifts are present).

We see that the performance of BBS, IR and BAY estimates deteriorates for large discrepancies between q and q . However, for |q q | 0.05, the median error of BBS, IR and BAY is still arguably tame, so we hope that these methods can be employed even if the prior probability shift assumption is only approximately correct. Note that in the case when q > q, (i.e., the classifier has better predictive accuracy on the unlabeled data set than on the labeled data set, which we think rarely occurs in practice), CC outperforms other methods.

4.3 Uncertainty assessment in a misspecified model of a mixture of Student distributions

As mentioned in Sec. 3.1, using a black-box function f : X C to reduce the dimensionality to a set {1, 2, . . . , K} not only improves the tractability of the problem, but also has the potential to make the model more robust to misspecification by replacing the prior probability shift assumption (invariance of P(X | Y ) with its weak version (invariance of P(C | Y )) and by learning the parameters of a categorical discrete distribution, rather than parameters of a potentially misspecified distribution Ky(X; θ). On the other hand, Mapprox loses information from the problem, so we do not expect it to be as appropriate as a properly specified generative model for P(X | Y ).

To investigate properies of the Mapprox approach we generated low-dimensional data, so that Bayesian inference in Mtrue is still tractable: we consider a mixture of two Student t-distributions presented in Fig. 4 from which we sample N = N = 103 points. We implemented a Gaussian mixture model and a Student mixture distribution in Num Pyro (Phan et al., 2019), setting weakly-informative priors on their parameters (see Appendix E.3). For the Mapprox we partitioned the real axis into K bins: ( , 4), [ 4, a1), [a1, a2), . . . , [a K 3, 4), [4, ), with all intervals (apart from the first and the last one) of equal length.

Published in Transactions on Machine Learning Research (06/2024)

Astrocyte Malignant Myeloid

Oligo. Vascular

All cell types

Aggregated cell types

Observed Posterior mean BBS RIR CC EM RIR (soft)

Figure 5: First two panels: principal components of the feature vectors X colored by the biopsy sample and cell types. Third panel: inferred cell type proportions. Fourth panel: inferred cell type proportions with vascular cells and neurons merged into one type.

In Fig. 4 we see that using 10 bins yields very similar posterior to the one of a well-specified model and that using 5 bins yields a wider posterior, which agrees with the perspective that discretization resulted in information loss. However, a misspecified Gaussian mixture model concentrates around a wrong value.

We repeated the simulation S = 200 times, excluded the runs with convergence issues (see Appendix E.3), and checked the frequentist coverage of the highest-density credible intervals. The coverage of the discretized models as well as of the properly specified Student mixture agrees well with the expected value. However, the credible intervals from the misspecified Gaussian mixture are systematically too narrow. As we demonstrate in Appendix E.3, misspecification is less problematic for lower sample sizes and more problematic for large sample sizes.

We conclude that, in the considered setting, conditioning on an insufficient statistic, as proposed by Lewis et al. (2021) for regression problems, is a viable solution also for quantification. However, it is not as data-efficient as a properly-specified generative model if one is available. We did not compare obtained posteriors in high-dimensional problems due to high computational costs associated with running MCMC on high-dimensional generative models.

4.4 Prevalence estimation in real world data

As a more practical application of the proposed quantification method, we consider single-cell RNA sequencing data. Darmanis et al. (2017) collected biopsy specimens from four glioblastoma multiforme tumors corresponding to four different populations of cells. Each cell belongs to one of six healthy types (astrocyte, neuron, oligodendrocyte, OPC, myeloid or vascular) or is a malignant cancer cell, yielding in total L = 7 distinct cell types.

Each cell is sequenced to yield a gene expression vector X with 23,368 entries. Basing on gene expression vectors and known marker genes, the cells have been assigned to the considered cell types we treat provided annotations as ground-truth labels. In Fig. 5 we plotted the first two principal components of the whole data set to visualise the distribution of features within each sample and with relation to the cell type. We note that although Darmanis et al. (2017) used TPM normalization (Zhao et al., 2021) to normalize the features X and the cells seem to roughly cluster within cell types, the sample-specific effects are still visible (see also Appendix E.4), so that the prior probability shift assumption, of invariant P(X | Y ), is violated.

We consider a semi-realistic scenario in which one wants to estimate cell prevalence in an automated fashion employing a given black-box cell type classifier. We treat the first two samples as an auxiliary cell atlas on which a generic black-box cell type classifier was trained (we use a random forest), the third sample as an available labeled data set, {(Xi, Yi)}, and the fourth sample as an unlabeled data set, {X j}, for the quantification problem. Although the prior probability shift is violated, we can hope that the labels predicted by the random forest are more invariant and that methods working under the weak prior probability shift assumption may still yield reasonable estimates.

Published in Transactions on Machine Learning Research (06/2024)

We consider quantification method using predicted labels (posterior mean in the proposed model, BBS, RIR, and CC) as well as two methods accepting probabilities, rather than labels: Expectation-Maximization (EM) and a soft variant of the restricted invariant ratio estimator (RIR (soft)). We do not use recalibration techniques proposed by Alexandari et al. (2020), using the vanilla probabilities provided by the random forest.

Generally, the posterior mean captures well the true cell types prevalence, showing large uncertainty around the vascular cells and neurons. Similar performance is obtained by the simple CC baseline, owing to the good performance of the classifier. Interestingly, the methods using estimated probabilities (RIR (soft) and EM) obtained the worst performance. We hypothesise that this is due to the violated prior probability assumption and that using discrete labels, rather than continuous probabilities, improves the robustness.

As methods employing matrix inversion and a black-box classifier (BBS and RIR) failed due to noninvertibility of the estimated matrices, we then decided to merge two least prevalent cell types occurring in the data set (vascular cells and neurons) into a single Other class. In that case, RIR and BBS obtain performance on par with posterior mean and CC.

5 Discussion

The presented approach generalizes point estimates provided by black-box shift estimators and invariant ratio estimators to the Bayesian inference setting. This allows one to quantify uncertainty and use existing knowledge about the problem by prior specification. Moreover, by the construction of the sufficient statistic our approach is tractable even in large-data limit (for either data set considered). In all our experiments, the suggested estimator obtained at least as good performance as the existing methods, outperforming them in the K < L case where the number of modeled classes differs from the true number of classes. Compared to point estimates with asymptotic guarantees, our approach knows what it does not know , meaning that the posterior is meaningful even if the matrix P(C | Y ) is not (left-)invertible, and it is specific for the prevalence values of those classes for which the feature extractor f is sufficiently informative.

Moreover, the proposed approach aligns with the approaches employing black-box feature extractors f, which can be trained on a different data set and may be not calibrated properly. This is particularly useful when a hard, fully black-box classifier is given without the possibility of retraining it, which is an increasingly common theme with modern AI applications, which are often huge assets doing sophisticated processing, and also often proprietary and only available through APIs.

However, the method we introduce is not free from challenges. As in all Bayesian inferences, care is required regarding modeling assumptions: whether the discrete model is applicable and what prior should be used. In particular, even the weak prior probability shift assumption may not hold (e.g., if the labeled and unlabeled data sets were collected under radically different conditions or the labeled and unlabeled data sets have different classes Y). Additionally, Bayesian inference often carries a model choice problem, and different choices for K or the discretization method f may yield different posteriors on the prevalence vector π , especially in the low data regime. If the generative model P(X | Y, θ) is well-specified and tractable, we suggest to use this instead of an approximation P(C | Y, φ). If it is not tractable, we suggest to use the available classifier with K classes, observing the quality of φ = P(C | Y ) matrix, and perhaps training one s own classifier on some hold-out data set. The experiments performed in Section 4.2 and Theorem 3 of Lipton et al. (2018) suggest that improving the classification accuracy results in more accurate prevalence quantification. Finally, we rely on the Markov chain Monte Carlo sampling schemes, which explore an O(KL)-dimensional parameter space. Although in this manuscript we focus on problems with a moderate number of classes, there exist complex data sets, such as Image Net (Deng et al., 2009), with thousands of categories. We note that this can pose additional computational challenges for the MCMC samplers (Betancourt, 2015; Bardenet et al., 2017; Izmailov et al., 2021), which we do not investigate in this work.

Published in Transactions on Machine Learning Research (06/2024)

Statement of broader impact

This article discusses a Bayesian method of quantifying the prevalence of different classes in an unlabeled data set. We note that in general the parameter posterior conditioned on the full data view X can be different from the posterior conditioned on some representation C = f(X) in cases where a reliable model P(X | Y ) is available and the inference is tractable, we suggest to use this instead of our discretized method. Secondly, the model need not apply perhaps prior probability shift is not the only distribution shift occurring in the problem or the data may not be exchangeable. In epidemiology, for example, outbreaks induce correlations between the healthiness of different people that can easily extend to sampling. Finally, even if all the assumptions hold, recalibrating a probabilistic classifier with quantification may have undesirable consequences regarding fairness (Plecko & Bareinboim, 2022).

Code availability and reproducibility

We ensured reproducibility by designing all experiments as Snakemake workflows (Mölder et al., 2021). The code and workflows used to run the experiments and generate the figures are available in the https: //github.com/pawel-czyz/labelshift repository.

Acknowledgments

We would like to thank Ian Wright for valuable comments on the manuscript. This publication was supported by Git Hub, Inc. and ETH AI Center. We would like to thank both institutions.

Published in Transactions on Machine Learning Research (06/2024)

Amr M. Alexandari, Anshul Kundaje, and Avanti Shrikumar. Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In Proceedings of the 37th International Conference on Machine Learning, ICML 20. JMLR.org, 2020.

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant Risk Minimization. ar Xiv e-prints, art. ar Xiv:1907.02893, Jul 2019.

Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and Animashree Anandkumar. Regularized learning for domain adaptation under label shifts. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019. URL https://openreview.net/ forum?id=r Jl0r3R9KX.

Rémi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18(47):1 43, 2017. URL http://jmlr.org/papers/v18/15-205. html.

Antonio Bella, Cesar Ferri, José Hernández-Orallo, and María José Ramírez-Quintana. Quantification via probability estimators. In 2010 IEEE International Conference on Data Mining, pp. 737 742, 2010. doi: 10.1109/ICDM.2010.75.

J.M. Bernardo and A.F.M. Smith. Bayesian Theory. Wiley Series in Probability and Statistics. Wiley, 1994. ISBN 9780471494645.

Michael Betancourt. The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In Francis Bach and David Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 533 540, Lille, France, 07 09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/betancourt15.html.

Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo, 2017. URL https://arxiv. org/abs/1701.02434.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3 (null):993 1022, mar 2003. ISSN 1532-4435.

Mathieu Blondel, Akinori Fujino, and Naonori Ueda. Large-scale multiclass support vector machine training via euclidean projection onto the simplex. In 2014 22nd International Conference on Pattern Recognition, pp. 1289 1294, 2014. doi: 10.1109/ICPR.2014.231.

Daniel C. Castro, Ian Walker, and Ben Glocker. Causality matters in medical imaging. Nature Communications, 11:3673ff., 7 2020.

Spyros Darmanis, Steven A. Sloan, Derek Croote, Marco Mignardi, Sophia Chernikova, Peyman Samghababi, Ye Zhang, Norma Neff, Mark Kowarsky, Christine Caneda, Gordon Li, Steven D. Chang, Ian David Connolly, Yingmei Li, Ben A. Barres, Melanie Hayden Gephart, and Stephen R. Quake. Single-cell rnaseq analysis of infiltrating neoplastic cells at the migrating front of human glioblastoma. Cell Reports, 21(5):1399 1410, 2017. ISSN 2211-1247. doi: https://doi.org/10.1016/j.celrep.2017.10.030. URL https: //www.sciencedirect.com/science/article/pii/S2211124717314626.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1 26, 1979. doi: 10.1214/aos/1176344552. URL https://doi.org/10.1214/aos/1176344552.

Valerii Fedorov, Frank Mannino, and Rongmei Zhang. Consequences of dichotomization. Pharmaceutical Statistics, 8(1):50 61, 2009. doi: https://doi.org/10.1002/pst.331. URL https://onlinelibrary.wiley. com/doi/abs/10.1002/pst.331.

Published in Transactions on Machine Learning Research (06/2024)

George Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17: 164 206, October 2008.

Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary Lipton. A unified view of label shift estimation. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 3290 3300. Curran Associates, Inc., 2020.

A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2013. ISBN 9781439840955. URL https://books.google.pl/books?id=ZXL6AQAAQBAJ.

Andrew Gelman, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. Bayesian workflow, 2020.

Pablo González, Alberto Castaño, Nitesh Chawla, and Juan del Coz. A review on quantification learning. ACM Computing Surveys, 50:1 40, 09 2017. doi: 10.1145/3117807.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 17, pp. 1321 1330. JMLR.org, 2017.

Paul R. Halmos and L. J. Savage. Application of the Radon-Nikodym Theorem to the Theory of Sufficient Statistics. The Annals of Mathematical Statistics, 20(2):225 241, 1949. doi: 10.1214/aoms/1177730032. URL https://doi.org/10.1214/aoms/1177730032.

F.E. Harrell. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer Series in Statistics. Springer International Publishing, 2015. ISBN 9783319194257. URL https://books.google.pl/books?id=94Rg Cg AAQBAJ.

Matthew D. Hoffman and Andrew Gelman. The no-u-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47):1593 1623, 2014. URL http: //jmlr.org/papers/v15/hoffman14a.html.

Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Gordon Wilson. What are Bayesian neural network posteriors really like? In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 4629 4640. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/ izmailov21a.html.

Nikolay Karpov, Alexander Porshnev, and Kirill Rudakov. NRU-HSE at Sem Eval-2016 task 4: Comparative analysis of two iterative methods using quantification library. In Proceedings of the 10th International Workshop on Semantic Evaluation (Sem Eval-2016), pp. 171 177, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/S16-1025. URL https://www.aclweb.org/ anthology/S16-1025.

Achim Klenke. Probability Theory: A Comprehensive Course. Springer, 2014.

John K. Kruschke. Bayesian analysis reporting guidelines. Nature Human Behaviour, 5(10):1282 1291, Oct 2021. ISSN 2397-3374. doi: 10.1038/s41562-021-01177-7. URL https://doi.org/10.1038/ s41562-021-01177-7.

John R. Lewis, Steven N. Mac Eachern, and Yoonkyung Lee. Bayesian Restricted Likelihood Methods: Conditioning on Insufficient Statistics in Bayesian Regression (with Discussion). Bayesian Analysis, 16 (4):1393 2854, 2021. doi: 10.1214/21-BA1257. URL https://doi.org/10.1214/21-BA1257.

Zachary C. Lipton, Yu-Xiang Wang, and Alexander J. Smola. Detecting and correcting for label shift with black box predictors. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 3128 3136. PMLR, 2018.

Published in Transactions on Machine Learning Research (06/2024)

Simon Lyddon, Stephen Walker, and Chris C Holmes. Nonparametric learning from Bayesian models with randomized objective functions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Alejandro Moreo, Andrea Esuli, and Fabrizio Sebastiani. Qua Py: a Python-based framework for quantification. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 4534 4543, 2021.

F Mölder, KP Jablonski, B Letcher, MB Hall, CH Tomkins-Tinch, V Sochat, J Forster, S Lee, SO Twardziok, A Kanitz, A Wilm, M Holtgrewe, S Rahmann, S Nahnsen, and J Köster. Sustainable data analysis with Snakemake. F1000Research, 10(33), 2021. doi: 10.12688/f1000research.29032.1.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

C. Peters and W.A Coberly. The numerical evaluation of the maximum-likelihood estimate of mixture proportions. Communications in Statistics Theory and Methods, 5:1127 1135, 1976.

Du Phan, Neeraj Pradhan, and Martin Jankowiak. Composable effects for flexible and accelerated probabilistic programming in numpyro. ar Xiv preprint ar Xiv:1912.11554, 2019.

Drago Plecko and Elias Bareinboim. Causal fairness analysis, 2022.

Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2):945 959, 06 2000. ISSN 1943-2631. doi: 10.1093/genetics/ 155.2.945. URL https://doi.org/10.1093/genetics/155.2.945.

Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14:14 21, 2001.

Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML 12, pp. 459 466, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851.

Shai Shalev-Shwartz and Yoram Singer. Efficient learning of label ranking by soft projections onto polyhedra. J. Mach. Learn. Res., 7:1567 1599, dec 2006. ISSN 1532-4435.

Amos Storkey. When training and test sets are different: Characterizing learning transfer. In Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence (eds.), Dataset Shift in Machine Learning, chapter 1, pp. 3 28. The MIT Press, 2009. ISBN 0262170051.

Dirk Tasche. Fisher consistency for prior probability shift. Journal of Machine Learning Research, 18(95): 1 32, 2017. URL http://jmlr.org/papers/v18/17-048.html.

Dirk Tasche. Confidence intervals for class prevalences under prior probability shift. Machine Learning and Knowledge Extraction, 1(3):805 831, 2019. ISSN 2504-4990. doi: 10.3390/make1030047. URL https: //www.mdpi.com/2504-4990/1/3/47.

Afonso Fernandes Vaz, Rafael Izbicki, and Rafael Bassi Stern. Quantification under prior probability shift: the ratio estimator and its extensions. Journal of Machine Learning Research, 20(79):1 33, 2019. URL http://jmlr.org/papers/v20/18-456.html.

James Watson and Chris Holmes. Approximate Models and Robust Decisions. Statistical Science, 31(4):465 489, 2016. doi: 10.1214/16-STS592. URL https://doi.org/10.1214/16-STS592.

Jack Xue and Gary Weiss. Quantification and semi-supervised classification methods for handling changes in class distribution. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 897 906, 01 2009. doi: 10.1145/1557019.1557117.

Published in Transactions on Machine Learning Research (06/2024)

Andy B. Yoo, Morris A. Jette, and Mark Grondona. Slurm: Simple linux utility for resource management. In Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (eds.), Job Scheduling Strategies for Parallel Processing, pp. 44 60, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 978-3-540-39727-4.

Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML 13, pp. III 819 III 827. JMLR.org, 2013.

Yingdong Zhao, Ming-Chung Li, Mariam M. Konaté, Li Chen, Biswajit Das, Chris Karlovich, P. Mickey Williams, Yvonne A. Evrard, James H. Doroshow, and Lisa M. Mc Shane. TPM, FPKM, or normalized counts? A comparative study of quantification measures for the analysis of RNA-seq data from the NCI patient-derived models repository. Journal of Translational Medicine, 19(1):269, Jun 2021. ISSN 14795876. doi: 10.1186/s12967-021-02936-w. URL https://doi.org/10.1186/s12967-021-02936-w.

Albert Ziegler and Paweł Czyż. Unsupervised recalibration, 2020. URL https://arxiv.org/abs/1908. 09157.

Published in Transactions on Machine Learning Research (06/2024)

A Strict linear independence of measures

We use the following definition:

Definition A.1. Let K1, . . . , KL be probability measures on a measurable space X. We say that they are strictly linearly independent if for every non-zero λ RL, there exists a measurable set Aλ such that

λ1K1(Aλ) + + λLKL(Aλ) = 0.

This definition is especially useful for proving identifiability of a well-specified mixture model:

Theorem A.2. Let K1, . . . , KL be strictly linearly independent probability measures on space X.

Then, for every two vectors of mixture weights π, π L 1 such that

π1K1 + + πLKL = π 1K1 + + π LKL

it follows that π = π .

Proof. Consider the difference λ = π π RL. If it was not the zero vector, then for some set Aλ we would have π1K1(Aλ) + + πLKL(Aλ) = π 1K1(Aλ) + + π LKL(Aλ).

Note also that Aλ is not of positive measure with respect to both of the mixture measures, meaning that for a well-specified model the discrepancy will eventually be visible according to the law of large numbers.

Although the definition above looks different from the original one proposed by Garg et al. (2020), they are essentially equivalent:

Lemma A.3. Let µ be a σ-finite measure on X such that the Radon Nikodym derivatives ky = d Ky/dµ exist. Then, the probability measures K1, . . . , KL are strictly linearly independent if and only if

for every non-zero λ RL.

Proof. Consider any λ = 0 and write ν = λ1K1 + + λKKL for the signed measure. Using the standard rules of Radon Nikodym calculus, the condition on the integral is equivalent to |ν|(X) = 0. If there exists Aλ such that ν(Aλ) = 0, then |ν|(X) |ν|(Aλ) |ν(Aλ)| > 0.

Conversely, assume that |ν|(X) = 0 and take the Hahn decomposition of X, with X = P N such that P N = , ν(P) 0, and ν(N) 0. We define now Hahn Jordan decomposition of ν into two positive measures, ν+(A) = ν(A P) and ν (A) = ν(A N), with the properties ν = ν+ ν and |ν| = ν+ + ν . We conclude that |ν|(X) = ν(P) ν(N) = 0, so that at least one of the sets P and N can be taken as Aλ.

The above characterization of strict linear independence gives the following result:

Lemma A.4. Assume that X is a standard Borel space and µ is a strictly positive measure. Further, assume that the Radon Nikodym derivatives k1, . . . , k L are continuous functions, treated as vectors in the space of all continuous functions C(X, R). Then, if k1, . . . , k L are linearly independent, then K1, . . . , KL are strictly linearly independent.

Proof. Take any λ = 0 and write u = |λ1k1 + +λLk L| C(X, R). From the linear independence it follows that there exists x0 X such that u(x0) > 0.

Published in Transactions on Machine Learning Research (06/2024)

We can use the continuity of u to find an open neighborhood A of x0 such that for all x A we have u(x) > u(x0)/2. As u is non-negative and µ is strictly positive, we have µ(A) > 0 and Z

X u(x) dµ(x) Z

A u(x) dµ(x) u(x0)

2 µ(A) > 0.

B Derivation of the sufficient statistic

Starting from the joint probability

P(π, π , φ, {Yi, Ci}, {Y j , C j}) = P(π, π , φ)

i=1 P(Ci | φ, Yi)P(Yi | π)

j=1 P(C j | φ, Y j )P(Y j | π ),

we need to derive

P(π, π , φ | {Yi, Ci}, {C j}) P({Yi, Ci}, {C j} | π, π , φ)P(π, π , φ),

The observed likelihood is given by marginalization of Y j variables:

P({Yi, Ci}, {C j} | π, π , φ) = X

i=1 P(Ci | φ, Yi)P(Yi | π)

j=1 P(C j | φ, Y j = lj)P(Y j = lj | π)

i=1 P(Ci | φ, Yi)P(Yi | π)

j=1 P(C j | φ, Y j = lj)P(Y j = lj | π )

Each of these terms will be calculated separately.

We want to calculate

i=1 P(Ci = ci | φ, Yi = yi)P(Yi = yi | π) =

i=1 P(Ci = ci | φ, Yi = yi)

i=1 P(Yi = yi | π)

The term A2 is simple to calculate: as P(Yi = yi | π) = πyi, we have

l=1 (πl)nl,

where nl is the number of i {1, . . . , N}, such that yi = l. In particular, up to a factor N!/n1! . . . n L!, this is the PMF of the multinomial distribution parametrised by π evaluated at (n1, . . . , n L).

To calculate A1 we need to observe that P(Ci = k | φ, Yi = l) = φlk. Hence,

i=1 P(Ci = ci | φ, Yi = yi) =

k=1 (φlk)flk,

where flk is the number of i {1, . . . , N}, such that yi = l and ci = k. Observe that nl = fl1 + + fl K.

In particular, up to the factor L Y

nl! fl1! . . . fl K!

Published in Transactions on Machine Learning Research (06/2024)

this corresponds to the product of PMFs of L multinomial distributions parametrised by probabilities φl: evaluated at fl:.

Recall that

j=1 P(C j = c j | φ, Y j = lj)P(Y j = lj | π ).

We can use the sum-product identity

j=1 fj(lj) =

l Y P(C j = c j | φ, Y j = l)P(Y j = l | π ).

Because both C j and Y j are parametrised with categorical distributions, we have

P(C j = k | φ, Y j = l) = φlk

and P(Y j = l | π ) = π l,

l Y P(C j = k | φ, Y j = l)P(Y j = l | π ) = (φT π )k.

j=1 (φT π )c j =

(φT π )k n k,

where n k is the number of j {1, . . . , N } such that c j = k. In particular, up to a factor of N !/n 1! n K!, this is the PMF of the multinomial distribution parametrized by probabilities φT π evaluated at (n 1, . . . , n K).

C Proof of asymptotic identifiability

In this section we prove Theorem 3.1. We first need to establish two simple lemmas regarding approximate left inverses:

Lemma C.1. Choose any norms on the space of linear maps RL RK and RK RL. Suppose K L and that A0 : RL RK is of full rank L. Then, for every ε > 0 there exists δ > 0 such that if A: RL RK

is any matrix such that ||A A0|| < δ,

then the left inverse A 1 := (AT A) 1AT exists and

||A 1 A 1 0 || < ε.

Proof. First note that indeed the choice of norms does not matter, as all norms on finite-dimensional vector spaces are equivalent.

Then, observe that rank is a lower semi-continuous function, so that for sufficiently small δ the map A will be of rank L as well.

Finally, it is clear that the chosen formula for the left inverse is continuous as a function of A.

Published in Transactions on Machine Learning Research (06/2024)

Lemma C.2. If K L and matrix A0 : RL RK is of full rank L, then for every ε > 0 there exist numbers δ > 0 and ν > 0 such that for every linear mapping A: RL RK and vector v RL if

||A A0|| < δ

and ||Av A0v0|| < ν,

then ||v v0|| < ε.

Proof. Again, the norm on either space can be chosen arbitrarily without any loss of generality. We will choose the p-norm for vectors and the induced matrix norms.

From the previous lemma we know that for any chosen β > 0 we can take δ > 0 such that A is left-invertible and ||B B0|| < β,

where B = A 1 and B0 = A 1 0 are the left inverses in the form defined before.

Write w = Av and w0 = A0v0. We have

||v v0|| = ||Bw B0w0||

= ||(Bw B0w) + (B0w B0w0)||

= ||(B B0)w + B0(w w0)||

||(B B0)w|| + ||B0(w w0)||

||B B0|| ||w|| + ||B0|| ||w w0||

β||w|| + ||B0||ν.

We can bound each of these two terms by ε/3 choosing appropriate β and ν. Then, we can find δ yielding appropriate β.

Now the proof of Theorem 3.1 will proceed in two steps:

1. We show than for any prescribed probability we can find N and N large enough that the maximum likelihood solution will be close to the true parameter values.

2. Then, we show that for reasonable priors the maximum a posteriori solution will almost surely assymptotically converge to the maximum likelihood solution.

Let s assume that the data was sampled from the model with true parameters π , π , φ and take δ > 0 and ε > 0.

For any ν > 0 we can use the fact that log-likelihood is given by

ℓ(π, π , φ) = X

l Y Nl log πl + X

l Y Flk log φlk + X

k C N k log(φT π )k,

and by the strong law of large numbers we can find N and N large enough that with probability at least 1 δ we will have ||ˆπ π || < ν and || ˆφ φ || < ν, and || ˆφT ˆπ φ T π || < ν, where ˆπ, ˆφ, and ˆπ is the maximum likelihood estimate.

Basing on the previously established lemmas we conclude that we can pick ν small enough that ||ˆπ π || < ε, || ˆφ φ || < ε, and ||ˆπ π || < ε.

Now note that if we assume the PDF of the prior P(π, π , φ) to be continuous, we can take a compact neighborhood of (π , π , φ ) inside L 1 L 1 K 1 K 1 with probability mass arbitrarily close to 1. Then, the log-prior defined on this set will be bounded and the maximum a posteriori estimate can be made arbitrarily close to the maximum likelihood estimate with any desired probability.

Published in Transactions on Machine Learning Research (06/2024)

C.1 Should one rely on the maximum a posteriori estimate?

Although in the above section we discuss why the maximum a posteriori (MAP) estimate is consistent in the large-sample limit, we advise against using it: as the data is finite, the fully Bayesian approach is to use the full posterior (approximated by the Markov chain Monte Carlo samples) to understand the associated uncertainty on the provided estimates.

In settings in which the posterior distribution has to be summarized in terms of a single point estimate (e.g., for the comparison with estimators providing point estimates in Section 4.2), we advise for using the posterior mean, which is available directly from the MCMC samples. Conversely, the MAP estimate requires optimization, rather than sampling, and may not be well-defined (e.g., in non-identifiable problems with non-invertible P(C | Y ) matrix or when the sample size is not large enough).

Moreover, posterior mean may be preferable over MAP on the basis of Bayesian decision theory. We present a short version of the standard decision-theoretic argument and refer to Bernardo & Smith (1994, Sec. 5.1.5) for a more detailed treatment.

An approach employed in Bayesian decision theory is to define a loss function ℓ(ˆπ , π ) quantifying risk associated with choosing the estimate ˆπ when in reality the estimand attains π value. Various applications involve different loss functions, to be specified by the decision maker: for example, if the component π 1 describes the prevalence of a viral disease, one may prefer to choose a loss function assigning higher loss to the predictions underestimating the disease prevalence (i.e., ˆπ 1 < π 1 should result in a higher loss than ˆπ 1 > π 1 ). The Bayesian decision theory argues then to provide an estimate minimizing the expected loss, R ℓ(ˆπ , π ) d P(π | data).

The MAP estimate (provided that it exists) corresponds to the limit of the losses ℓ0 1,ϵ(ˆπ , π ) = 1[||ˆπ π || > ϵ] for ϵ 0+. The resulting discontinuous 0 1 loss may not be of direct interest in the applied problems: for finite data the estimates are expected to be different from the ground-truth value and one instead may try to quantify by how much amount the estimate differs. For an ℓ2(ˆπ , π ) = ||ˆπ π ||2 loss, the minimum of the expected loss is attained at the posterior mean.

At the same time, we stress that it is preferable to analyze the whole posterior, rather than to rely on a single point estimate, especially that in reality the decision maker may not have an access to an oracle loss function and can consider different candidate losses. In particular, in Section 4.2 and Appendix E.2 we quantify the algorithm performance using two losses for which the minimum may not be attained at the posterior mean. Nevertheless, the experimental results presented there confirm that the (arguably suboptimal) choice of the posterior mean as the point estimate is on par with the point estimates provided by other methods.

D Quantification algorithms

In this section we provide additional details on existing quantification methods, expanding on the description in Sec. 2.

D.1 Expectation-maximization and the Gibbs sampler

In this section we analyse the expectation-maximization algorithm introduced by Saerens et al. (2001), which assumes access to a well-calibrated probabilistic classifier providing the probabilities r(x) = P(Y | X = x, π). We assume a Dirichlet prior for P(π ), so that the expectation-maximization targets maximum a posteriori of the distribution P(π | {X j}, r). The original algorithm of Saerens et al. (2001) corresponds then to the uniform prior, P(π ) = Dirichlet(π | 1, 1, . . . , 1). Finally, we will show how to adjust the expectationmaximization algorithm to obtain a Gibbs sampler, sampling from the posterior P(π | {X j}, r). To improve readability we will generally drop conditioning on r, leaving it implicit.

Published in Transactions on Machine Learning Research (06/2024)

D.1.1 Expectation-maximization

Saerens et al. (2001) notice that if one has a well-calibrated classifier P(Y | X, π), then they also have an access to a distribution P(Y | X, π ):

P(Y = y | X = x, π ) P(Y = y, X = x | π ) = P(X = x | Y = y, π )P(Y = y | π )

= P(X = x | Y = y) π y,

where the proportionality constant does not depend on y. Analogously,

P(Y = y | X = x, π) P(X = x | Y = y) πy.

As P(X = x | Y = y) is the same, we can take the ratio of both expressions and obtain

P(Y = y | X = x, π ) P(Y = y | X = x, π)π y πy ,

where the proportionality constant does not depend on y. This yields unnormalized probabilities, which can be easily rescaled so that they sum up to 1.

Expectation-maximization is an iterative algorithm finding a stationary point of the log-posterior

log P(π | {X j = x j}) = log P(π ) + log P({X j = x j} | π )

= log P(π ) +

j=1 log P(X j = x j | π ).

In particular, by running the optimization procedure several times, we can aim at finding the maximum a posteriori estimate. Assume that at the current iteration the proportion vector is π (t). Then,

log P(X j = x j | π ) = log

y=1 P(X j = x j, Y j = y | π )

y=1 P(Y j = y | π(t), X j = x j) P(X j = x j, Y j = y | π ) P(Y j = y | π (t), X j = x j)

y=1 P(Y j = y | π (t), X j = x j) log P(X j = x j, Y j = y | π ) P(Y j = y | π (t), X j = x j)

where the bound follows from Jensen s inequality. Hence,

log P({X j = x j} | π ) =

j=1 log P(X j = x j | π )

y=1 P(Y j = y | π (t), X j = x j) log P(X j = x j, Y j = y | π ) P(Y j = y | π (t), X j = x j).

Q(π, π(t)) = log P(π) +

y=1 P(Y j = y | π(t), X j = x j) log P(X j = x j, Y j = y | π) P(Y j = y | π(t), X j = x j),

be a lower bound on the log-posterior. We will define the value π (t+1) by optimizing this lower bound, i.e., π (t+1) := argmaxπ Q(π , π (t)).

Published in Transactions on Machine Learning Research (06/2024)

Define auxiliary variables ξjy = P(Y j = y | π (t), X j = x j), which can be calculated using the probabilistic classifier r as above. Hence,

Q(π , π (t)) = log P(π ) +

ξjy log P(X j = x j, Y j = y) ξjy log ξjy

Note that the term ξjy log ξjy does not depend on π , so it does not have to be included in the optimization. Similarly, we can write log P(X j = x j, Y j = y | π ) = log P(X j = x j | Y j = y) + log π y and notice that log P(X j = x j | Y j = y) also does not depend on π . Hence, we have to optimize the expression

log P(π ) +

y=1 ξjy log π y,

where P(π ) is modeled as the Dirichlet distribution, Dirichlet(π | α1, . . . , αL). Hence, the optimization objective becomes

with the constraint π 1 + + π L = 1. Saerens et al. (2001) use the technique of Lagrange multipliers. However, we can optimise the first L 1 coordinates and write π L = 1 (π 1 + + π L 1). In this case, if

we differentiate with respect to π y, we obtain Ay/π y AL/π L = 0, where Ay = αy 1 + PN

Hence, π y = k Ay for some constant k > 0. As

y=1 αy L + N ,

π y = 1 (α1 + + αL) + N L

which is taken as the next iteration value, π (t+1). The procedure is repeated until the sequence π (t)

(approximately) converges to a point.

D.1.2 Gibbs sampler

As typical for expectation-maximization algorithms, it is possible to implement a Gibbs sampler targeting the sample from the posterior P(π | {X j}), rather than the mode (maximum a posteriori).

The Gibbs sampler will iteratively sample from the high-dimensional P(π , {Y j } | {X j}) distribution. Note that for a Dirichlet prior we have

P(π | {Y j = yj, X j}) = Dirichlet

j=1 1[yj = 1], . . . , αL +

j=1 1[yi = L]

The assignments of individual points are then sequentially sampled as

Y k P(Y k | {Y 1, . . . , Y k 1, Y k+1, . . . , Y L}, {X j}, π ),

which is possible due to the equality

P(Y k | X k, π) = Categorical(ξk1, . . . , ξk L),

where ξky = P(Y k = y | X k = xk, π ), which is obtained using the probabilistic classifier r similarly as above.

Published in Transactions on Machine Learning Research (06/2024)

D.2 Estimators employing auxiliary black-box classifiers

In this section we briefly review the main algorithms employing a black-box classifier.

D.2.1 Classify and count

When C = Y and f : X Y is a classifier trained for a given problem with good accuracy, the simplest approach is to count its predictions and normalize by the total number of examples in the unlabeled data set. However, as Tasche (2017) shows, this approach does not need to correctly estimate P(Y ) even in the limit of infinite data.

D.2.2 Adjusted classify and count

Consider a case of an imperfect binary classifier, with Y = C = {+, }. The true and false positive rates are defined by

TPR = P(C = + | Y = +)

FPR = P(C = + | Y = )

and can be estimated using the labeled data set.

If θ = Ptest(Y = +), we have

Ptest(C = +) = TPR θ + FPR (1 θ)

which can be estimated by applying the classifier to the unlabeled data set and counting positive outputs.

If we assume that TPR = FPR, i.e., the classifier has any predictive power, we obtain

θ = Ptest(C = +) FPR

Then, Ptest(C = +) is estimated by counting the predictions of the classifier on the unlabeled data set. As Tasche (2017) showed, it is consistent in the limit of infinite data.

Two generalizations, extending it to the problems with more classes, are known as the invariant ratio estimator and black-box shift estimator.

D.2.3 Invariant ratio estimator

Vaz et al. (2019) introduce the invariant ratio estimator, generalizing the Adjusted Classify and Count approach as well as the soft version of it proposed by Bella et al. (2010).

Consider any function f : X RL 1. For example, if u: X Y is a classifier predicting outputs in the set {1, . . . , L}, we may define f as the one-hot encoding of L 1 labels and assign the zero vector to the last label:

(1, 0, . . . , 0) if u(x) = 1, (0, 1, . . . , 0) if u(x) = 2, ... (0, 0, . . . , 1) if u(x) = L 1, (0, 0, . . . , 0) if u(x) = L.

Analogously, for a soft classifier u: X L 1 RL, f may be defined as fk(x) = uk(x) for k {1, . . . , L 1}.

Published in Transactions on Machine Learning Research (06/2024)

Then the unrestricted estimator ˆπ RL is given by solving the linear system

ˆF11π 1 + + ˆF1Lπ L = ˆf 1 ... ˆFL 1,1π 1 + + ˆFL 1,Lπ L = ˆf L 1 π 1 + + π L = 1

j=1 gk(x j)

and ˆFkl = 1 |Sl|

x Sl gk(x),

where Sl is the subset of the labeled data set with yi = l.

Note that adjusted classify and count is a special case of the invariant ratio estimator, for a hard classifier. Similarly, the algorithm proposed by Bella et al. (2010) is a special case of invariant ratio estimator for a soft classifier.

The generalization for K = L is immediate, with ˆG becoming a (K 1) L matrix and ˆg becoming a vector of dimension K 1. Finally, Vaz et al. (2019) introduce a restricted estimator ˆπ R L 1, which is given by a projection of ˆπ U onto the probability simplex. In our implementation we use the projection via sorting algorithm (Shalev-Shwartz & Singer, 2006; Blondel et al., 2014).

D.2.4 Black-box shift estimator

Black-Box shift estimators are also based on the observation that

Ptest(C) = P(C | Y )Ptest(Y ),

where P(C | Y ) matrix can be estimated using either labeled or the unlabeled data set. Instead of solving this matrix equation directly by finding the (left) inverse, Lipton et al. (2018) estimate the pointwise ratio R(Y ) = Ptest(Y )/Ptrain(Y ) by rewriting this equation as

Ptest(C) = Ptrain(C, Y )R(Y ),

and estimate the joint probability matrix Ptrain(C, Y ) using the labeled data set. Then, the equation can be solved for R(Y ). By pointwise multiplication by Ptrain(Y ) (estimated using the labeled data set) the prevalence vector Ptest(Y ) is found.

Note that this approach naturally generalizes to the K = L case. Lipton et al. (2018) study the case K = L and derive asymptotic error bounds. More rencently, Azizzadenesheli et al. (2019) introduced a regularized variant of this approach.

D.2.5 Unsupervised recalibration

Ziegler & Czyż (2020) study the quantification problem from the perspective of recalibration of a given probabilistic classifier. Their method can be interpreted as partly a black-box shift estimator and partly as a likelihood-based estimator. Namely, they propose to use a black-box classifier to predict the labels Ci = f(Xi) and C j = f(X j) and estimate the probability table P(C | Y ) by using the plug-in estimator. However, they note that solving explicitly Eq. 1 may suffer from numerical issues when condition number is high and instead they optimize the multinomial likelihood on the observed counts C j.

Published in Transactions on Machine Learning Research (06/2024)

D.3 Other algorithms

Other quantification methods include the CDE-Iterate algorithm of Xue & Weiss (2009), which can obtain good empirical performance on selected problems (Karpov et al., 2016). However, as Tasche (2017, Sec. 3.4) showed, it is not asymptotically consistent. Zhang et al. (2013) describes a kernel mean matching approach, with a provable theoretical guarantee. However, as Lipton et al. (2018, Sec. 6) observed, kernel-based methods may be challenging to scale to large data sets. Finally, Moreo et al. (2021) present a Python package for quantification problems.

E Experimental details and additional experiments

In this section we provide additional details on experimental protocols used in Sec. 4 together with additional experimental results. Experiments described in Appendices E.1, E.4, and E.5 were run on a laptop with 32 Gi B RAM and 16 CPU cores clocked at 4680 MHz and finished under six hours. Experiments described in Appendices E.2 and E.3 require runs over many random seeds and require larger computing power (unless the number of random seeds is reduced). We ran them sequentially on a cluster equipped with 384 Gi B RAM and 128 CPU cores clocked at 2.25 3.7 GHz. As the cluster is shared between different researchers and uses the Slurm workload manager (Yoo et al., 2003) to distribute runs among different projects, this provides an upper bound on the computational resources used. These experiments have finished in five hours.

E.1 Nearly non-identifiable model

We reproduced the experiment described in Sec. 4.1 S = 5 times varying the random seed to obtain different data samples (and, subsequently, different posterior and bootstrap samples) as well as the data set size under N = N constraint. We noted that methods based on matrix inversion raised an error whenever a matrix estimated from the bootstrap sample was singular. We dropped such bootstrap samples.

In Fig. 6 we use N = N = 100, in Fig. 7 we use N = N = 103, and in Fig. 8 we use N = N = 104.

We generally see that bootstrap for N = N {102, 103} can result in negative prevalence estimates. On the other hand, restricted invariant ratio estimator (RIR) often does not appropriately estimate the first component. Hence, we consider the Bayesian posterior preferable in low-data settings. For N = N = 104

we see that the performance of all methods is comparable.

For each simulated data set, we ran four Markov chains with 500 warm-up steps and 1000 samples each using the NUTS algorithm of Hoffman & Gelman (2014). To flag potential convergence issues, we calculated the potential scale reduction factor ˆR (Gelman et al., 2013, Sec. 11.4). For each variable it did not exceed 1.01. The effective sample size was over 3,000 for all parameters and across all experimental settings.

E.2 Discrete categorical model benchmark

The default parameters introduced in Sec. 4.2 have been gathered in Table 1. For each data set we used four Markov chains with 500 warm-up steps and 1000 samples collected per chain. The maximal ˆR did not exceed 1.01 and the minimal effective sample size was 772.

Fig. 9 represents the outcomes of the experiment where mean absolute error metric has been replaced with the root mean squared error:

RMSE(ˆπ , π ) =

y (ˆπ y π y )2.

Qualitatively, the conclusions do not change.

E.3 Misspecified model

We repeated the experiment described in Sec. 4.3 for N = N {102, 103, 104} samples with the results presented in Fig. 10. For each generated data set and each method we ran four Markov chains with 1,500

Published in Transactions on Machine Learning Research (06/2024)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

Figure 6: For N = N = 100 samples the posterior is not very precise around the first component. We note that for unrestricted estimators the bootstrap samples result in negative probability estimates.

Published in Transactions on Machine Learning Research (06/2024)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

Figure 7: For N = N = 103 Bayesian posterior concentrates around the ground-truth value of the first component. However, bootstrap samples often yield negative probability estimates.

Published in Transactions on Machine Learning Research (06/2024)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

1 2 3 Class

Posterior samples

1 2 3 Class

RIR (bootstrapped)

1 2 3 Class

UIR (bootstrapped)

1 2 3 Class

BBS (bootstrapped)

Figure 8: For N = N = 104 samples the first component is perfectly determined. Bootstrap samples capture the uncertainty well and are qualitatively similar to the samples from the Bayesian posterior.

Published in Transactions on Machine Learning Research (06/2024)

0.6 0.8 Prevalence 1

Unlabeled sample size N

0.6 0.8 Classifier quality q

c. BAY BBS CC RIR

3 5 7 9 Classifier outputs K

3 5 7 9 Number of labels L = K

0.6 0.8 Misspecified quality q

Figure 9: Quantification using simulated categorical black-box classifiers under different scenarios.

4 2 0 2 4 x

Y = 0 Y = 1

4 2 0 2 4 x

Ptrain(X) and Ptest(X)

0.0 0.2 0.4 0.6 0.8 1.0 Ptest(Y = 1)

Posterior samples

0.00 0.25 0.50 0.75 1.00 Expected

Gaussian Student Discrete (5 bins) Discrete (10 bins)

4 2 0 2 4 x

Y = 0 Y = 1

4 2 0 2 4 x

Ptrain(X) and Ptest(X)

0.0 0.2 0.4 Ptest(Y = 1)

Posterior samples

0.00 0.25 0.50 0.75 1.00 Expected

Gaussian Student Discrete (5 bins) Discrete (10 bins)

4 2 0 2 4 x

Y = 0 Y = 1

4 2 0 2 4 x

Ptrain(X) and Ptest(X)

0.0 0.2 0.4 Ptest(Y = 1)

Posterior samples

0.00 0.25 0.50 0.75 1.00 Expected

Gaussian Student Discrete (5 bins) Discrete (10 bins)

Figure 10: Experiments with a mixture of Student distributions. First column: conditional Student distributions P(X | Y ). Second column: train (labeled) and test (unlabeled) distributions. Third column: posterior according to differend models. Fourth column: coverage of high-density credible intervals measured over S = 200 simulations. Top row: N = N = 100 samples. Middle row: N = N = 103 samples. Bottom row: N = N = 104 samples.

Published in Transactions on Machine Learning Research (06/2024)

warm-up steps and 2,000 samples collected per chain. We noticed occasional convergence issues: runs with at least one parameter such that ˆR 1.02 were excluded from the coverage calculation (see Table 2). The convergence for all models is satisfactory (with the minimal number of simulations retained being 157 out of conducted 200), although we noticed that more runs were excluded for larger sample sizes and that misspecification (i.e., the Gaussian mixture model) resulted in more runs to be excluded.

We see that the coverage of the high-density credible intervals of the discrete model generally agrees with the nominal value. However, the posterior in the discrete model can be wider than in the properly specified model using the generative mechanisms P(X | Y ). Moreover, we see that using discretized quantification method may be preferable over using a misspecified model when the large sample size is used: although for N = N = 100 the (misspecified) Gaussian mixture model has coverage close to the nominal value, for N = N {103, 104} the coverage deteriorates quickly.

Parameters of the ground-truth mixture model We used (X | Y = 1) T (0, 0.52, 3) and (X | Y = 2) T (1, 0.52, 4), where T (µ, σ2, ν) is the location-scale t-distribution, i.e., the pushforward distribution of the standard Student t-distribution T (0, 1, ν) with ν degrees of freedom by the affine mapping x 7 µ + σx.

We sampled the number of labeled samples with label Y = 1 using N1 Binomial(N, 0.5) and then defined the number of samples with label Y = 2 by N N1. Similarly, we generated an unlabeled data set, but with the class prevalence 0.2, rather than 0.5.

Priors on the Gaussian and Student mixture models In both cases we used the uniform prior, π Dirichlet(1, 1), on the prevalence vector. We modeled the scale parameters σi |C|(1) via the half Cauchy prior and the location parameters via µi N 0, 32 . Additionally, the Student mixture had a positive prior on the degrees of freedom, νi Γ(1, 1). The Gaussian mixture model can be treated as a special case of this model with the constraint νi = for both components.

E.4 Single-cell data analysis

We downloaded the TPM-normalized (Zhao et al., 2021) data sequenced by Darmanis et al. (2017) from the Curated Cancer Cell Atlas. We applied the x 7 log(1 + x) transform to all entries.

In Fig. 11 we visualize P(X | Y ) by projecting the gene expression X on the first four principal components (calculated using all samples pooled together). We see that the distribution P(X | Y ) differs between the samples, although the cell types seem to roughly cluster together and the random forest classifier may distinguish well between different subtypes.

As a random forest we used the Sci Kit-Learn implementation (Pedregosa et al., 2011, v. 1.4.1) with default hyperparameters and 20 trees. Before training the random forest we reduced the dimensionality by projecting the training data onto the first 50 principal components. This projection (onto the components defined by the training data) is used for making the predictions on other samples, before a random forest is applied.

Similarly as in App. E.1, in this experiment we ran four chains with 500 warm-up steps and 1000 samples each. To flag potential convergence issues, we calculated the potential scale reduction factors ˆR to two decimal places, which did not exceed 1.02. The minimal corresponding effective sample size was 298.

E.5 Sensitivity to the prior choice

In Sec. 4 we used uniform priors over all simplices, which is supposed to act as a reference prior (Gelman et al., 2013, Sec. 2.8) for problems without precise domain-specific information. In practice, the inference

Table 1: Default parameters used in the experiments.

N N r q L K 1000 500 0.7 0.85 5 5

Published in Transactions on Machine Learning Research (06/2024)

Table 2: Number of simulations (out of 200) with ˆR < 1.02 for each sample size N = N and model.

Sample Size Discrete (10 bins) Discrete (5 bins) Gaussian Student 100 200 200 192 200 1000 199 199 174 200 10000 199 196 157 197

Figure 11: Projections onto first four principal components of the whole data set. Each column describes the projections of a specified sample on the 1st and 2nd (first row) or the 3rd and 4th (second row) component, coloured by the cell type. We see distributional differences and conclude that P(X | Y ) is not invariant between samples.

may depend on the choice of the prior and a sensitivity analysis should be performed (see Kruschke (2021) or Gelman et al. (2020, Sec. 6.3)).

We illustrate the prior choice problem in an example with L = K = 2 classes with an imperfect classifier P(C = 1 | Y = 1) = P(C = 2 | Y = 2) = 0.7 and prevalence vectors π = (0.5, 0.5) and π = (0.7, 0.3). We sampled three data sets, differing by the number of points, N = N {50, 500, 5000} and we fitted three models, differing by the choice of the prior.

We used the symmetric Dirichlet priors Dirichlet(α, α, . . . , α) over all simplices. While it simultaneously controls the parameters of the φ and π matrix, a choice which is may not be often desired in practice, we hope that it provides meaningful information on the sensitivity of the inference to the choice of the prior. For α = 1 this prior is uniform and has been studied in Sec. 4. Choosing α < 1 encourages more sparsity: the prior on the π vector concentrates near to the boundary of the simplex (i.e., (1, 0) and (0, 1) vectors). Simultaneously, the model assumes that φ matrix corresponds to sharper predictions. For example, this means that one of the entries {P(C = 1 | Y = 1), P(C = 2 | Y = 1)} should be larger than the other, i.e., the true positive rate and false negative rate should be very different. For α > 1 the prior distributes the mass around the center, preferring balanced prevalence vectors π . Under this prior, the false negative rate and the true positive rate are similar.

We changed α {0.1, 1, 10} and performed Bayesian inference. Similarly as in Appendices E.1 and E.4 we ran four Markov chains with 500 warm-up steps and 1,000 samples each. To flag potential convergence issues, we calculated the potential scale reduction factors ˆR, which did not exceed 1.02. The minimal effective sample size was 362.

The posterior of different models is visualised in Fig. 12. We see that for a small sample size (N = N = 50), the sparse prior (α = 0.1) puts more mass at the boundary. At the same time, the predictions under α = 1

Published in Transactions on Machine Learning Research (06/2024)

0.0 0.5 1.0 1

0.0 0.5 1.0 1

N = N = 500

0.0 0.5 1.0 1

N = N = 5000

= 0.1 = 1.0 = 10.0

Figure 12: Sensitivity of the Bayesian inference with respect to the Dirichlet prior with concentration parameter α under different sample sizes N = N . We plot the posterior in the form of a histogram, together with a vertical line representing posterior mean. The dashed line represents the ground-truth value.

and α = 10 priors are nearly identical. For larger N = N the effect of the prior vanishes, with the posterior concentrating around the true value, which is in line with Theorem 3.1.