# predictive_inference_with_weak_supervision__bf111ad9.pdf

Journal of Machine Learning Research 25 (2024) 1-46 Submitted 3/23; Revised 5/24; Published 5/24

Predictive Inference with Weak Supervision

Maxime Cauchois maxime.cauchois@gmail.com Department of Statistics Stanford University Stanford, CA 94305-4020, USA

Suyash Gupta suyash028@gmail.com Department of Statistics Stanford University Stanford, CA 94305-4020, USA

Alnur Ali alnurali@gmail.com Department of Statistics Stanford University Stanford, CA 94305-4020, USA

John Duchi jduchi@stanford.edu Department of Statistics and electrical Engineering Stanford University Stanford, CA 94305-4020, USA

Editor: Daniel Hsu

The expense of acquiring labels in large-scale statistical machine learning makes partially and weakly-labeled data attractive, though it is not always apparent how to leverage such data for model ﬁtting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive conﬁdence sets sets that cover a true label with a prescribed probability, independent of the underlying distribution using weakly labeled data. To do so, we introduce a (necessary) new notion of coverage and predictive validity, then develop several application scenarios, providing eﬃcient algorithms for classiﬁcation and several large-scale structured prediction problems. We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) conﬁdence sets through several experiments. Keywords: Conformal inference, Conﬁdence sets, Coverage validity, Weak supervision, Partial labels

1 Introduction

Consider the typical supervised learning pipeline that we teach students in statistical machine learning: we collect data in (X, Y ) pairs, where Y is a label or target to be predicted; we pick a model and loss measuring the ﬁdelity of the model to observed data; we choose the model minimizing the loss and validate it on held-out data. This picture obscures what is becoming one of the major challenges in this endeavor: that of actually collecting highquality labeled data (Sculley et al., 2015; Donoho, 2017; Ratner et al., 2017; Gadre et al.,

2024 Maxime Cauchois, Suyash Gupta, Alnur Ali and John Duchi.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0253.html.

Cauchois, Gupta, Ali, and Duchi

2023). Hand labeling large-scale training sets is often impractically expensive. Consider, as simple motivation, a ranking problem: a prediction is an ordered list of a set of items, yet available feedback is likely to be incomplete and partial, such as a top element (for example, in web search a user clicks on a single preferred link, or in a grocery, an individual buys one kind of milk but provides no feedback on the other brands present). Developing methods to leverage such partial and weak feedback is therefore becoming a major focus, and researchers have developed methods to transform weak and noisy labels into a dataset with strong, gold-standard labels (Ratner et al., 2017; Zhang et al., 2017). In this paper, we adopt this weakly labeled setting, but instead of considering model ﬁtting and the construction of strong labels, we focus on validation, model conﬁdence, and predictive inference, moving beyond point predictions and single labels. Our goal is to develop methods to rigorously quantify the conﬁdence a practitioner should have in a model given only weak labels. First consider the standard supervised learning scenario for data (X, Y ) X Y: here, given a desired conﬁdence level α, the goal, rather than to provide point estimates b Y of Y given X, is to give a conﬁdence set mapping b Cn based on (Xi, Yi)n i=1 that guarantees the distribution-free coverage

P h Yn+1 b Cn(Xn+1) i 1 α, (1)

where (Xn+1, Yn+1) is a new observation following the same distribution as the ﬁrst n points. Conformal inference provides precisely these guarantees (Vovk et al., 2005; Lei, 2014; Lei and Wasserman, 2014; Lei et al., 2018; Barber et al., 2021). There are many scenarios, however, where it is natural to transition away from this strongly supervised setting with fully labeled examples. Above we note ranking: individuals are very unlikely to provide full feedback (Ailon et al., 2008; Duchi et al., 2013; Negahban et al., 2016). In multi-label image classiﬁcation (Boutell et al., 2004; Elisseeﬀand Weston, 2001), a labeler may identify a few items in a given scene but not all, leading to partial labeled feedback. A major challenge in industrial machine learning deployment is to monitor models once they are in production, where it may be challenging to collect high-quality labels, but weak supervision in the form of clicks on a recommended website, or agreeing to a suggested text message completion is relatively easy and cheap to collect. In all of these, developing valid conﬁdence sets and measures for our predictions is of growing importance, as we wish for models to be trustable, usable, and veriﬁable. With this as motivation, we consider supervised learning problems where, instead of directly observing the ground truth labels {Yi}n i=1, we observe only noisy partial labeling. We make this formal in two equivalent ways. In the ﬁrst, for each instance i [n], there exists a (random) function ϕi : Y Yweak belonging to a set Φ {Y Yweak} of partially supervising (or measurement) functions, such that we only observe Y weak i = ϕi(Yi). Equivalently, the pair (Y weak i , ϕi) speciﬁes a weak set Wi Y that contains Yi:

Wi := n y Y | ϕi(y) = Y weak i o Y, (2)

so that we observe a set Wi consistent with Yi. Instead of strong labels (Xi, Yi) iid P, we thus observe only (Xi, ϕi, Y weak i )n i=1. Two running examples with ranking and multi-label problems will help to illustrate our setting.

Predictive Inference with Weak Supervision

Example 1 (Ranking): In a ranking problem, the goal is to rank elements of a set of K items, indentiﬁed via [K] = {1, . . . , K}, where strong labels y SK, the set of permutations of K elements. The strong label y then speciﬁes item at each rank j, so that y(1) [K] is ranked ﬁrst. Two natural forms of weak labeling include

Top-1 feedback, where a response consists of the item ranked ﬁrst; when item j has the ﬁrst rank, this corresponds to the set W = {y SK | y(1) = j} of permutations with j in the ﬁrst position, so that Y weak = ϕ(Y ) = Y (1), and card(W) = (K 1)!

Pairwise comparison feedback, so that for a pair of items j1, j2 [K] speciﬁed in ϕ, ϕ(y) = 1 y 1(j1) < y 1(j2) , indicating whether the order y ranks j1 ahead of j2; the set of weak labels W = {y | y 1(j1) < y 1(j2)} thus satisﬁes and card(W) = k 1 2 (k 2)! = (k 1)(k 1)!

Note the duality between the pairs (Y weak, ϕ) and W; working with one or the other is frequently more convenient. 3

Example 2 (Multilabel object recognition): In a multilabel object recognition problem, there are K objects of interest, and on an input image x, the strong label y {0, 1}K

indicates which objects appear in the image. A labeler may choose (or recognize) only a subset I [K] of the objects, so that the (random) measurement ϕ(y) {0, 1}K satisﬁes ϕ(y)j = yj if j I and ϕj(y) = 0 otherwise, that is, Y weak j = Yj if j I and Y weak j = 0 otherwise. In this case, we may represent W as the set of elementwise larger vectors W = {y {0, 1}K | yj Yj for j I}, which has cardinality card(W) = 2K |I|. 3

Throughout, ϕi is a random preference function describing the parts of the ground truth label we observe in the partially labeled dataset. It also captures the information about the ground truth label that matter at test time. In the context of ranking (Ex. 1), this means that an individual cares only that their top-ranked item is ﬁrst, or that the ranking orders a particular subset of items correctly. Our key assumption is that the partial feedback acquisition distribution and distribution of future measurement functions ϕ coincide, so that providing a label y Y that maps to the weak label, i.e., ϕn+1(y) = Y weak n+1 , is correct. The ranking example 1 makes clear that this assumption is plausible, as an individual presumably is more likely to both provide feedback and care about the elements at the top of their rankings; other domains are similar. Finally, without loss of generality, one can always assume that Yweak = 2Y, as any preference function implicitly maps each element y Y to a subset of Y containing y; see equation (2). We consider two fundamental questions in this weakly-labeled setting:

(Q.i) When is it possible to provide (distribution-free) coverage, for example, using true labels Y , weak Y weak or sets W, or other measurements?

(Q.ii) What methods can guarantee coverage?

While a ﬁrst goal would be to produce a conﬁdence mapping using (Xi, Y weak i , ϕi)n i=1 guaranteeing the coverage (1), as we prove in Section 2.1, this would in general produce large and therefore uninformative conﬁdence sets. We therefore relax our coverage desiderata, instead

Cauchois, Gupta, Ali, and Duchi

seeking a conﬁdence set b Cn : X Y that covers the weak counterpart Y weak n+1 = ϕn+1(Yn+1) of the true label in the sense that

P h b Cn(Xn+1) Wn+1 = i = P h y b Cn(Xn+1) s.t. ϕn+1(y) = Y weak n+1 i 1 α. (3)

The direct application of conventional conformal inference methods (Vovk et al., 2005; Lei, 2014; Barber et al., 2021) to obtain this weak coverage is practically impossible, as the space of weak labels is a priori unknown, is typically quite large. Additionally, feedback in the form of collections of weak label sets is typically unusable; as in Example 1, we wish to provide labels and conﬁgurations in the target space Y directly. The condition (3) is weaker than the standard coverage (1), and any conﬁdence set satisfying (1) the former will also satisfy (3), though it allows smaller conﬁdence set sizes. A major challenge is that the function ϕn+1 representing an individual s weak supervision is a priori unknown (e.g., in Example 2 the items in an image a labeler will identify ahead of time). Indeed, if we observe ϕn+1 prior to our prediction, a trivial extension of classical conformal methodology (Vovk et al., 2005; Lei, 2014; Barber et al., 2021) achieves coverage (3): ﬁrst, construct a valid conﬁdence set mapping b Cn,weak : X Yweak for Y weak n+1 , which as in (1) would guarantee P(Y weak n+1 b Cn,weak(Xn+1)) 1 α. Then deﬁne b Cn(x) Y to include a single y Y for each yweak b Cn,weak(x) such that ϕn+1(y) = yweak. Example 1 shows the impossibility of such an approach: we do not know ahead of time if an individual cares only about the top-ranked item or requires a ranking accurate to the 10th item. Given the subtleties of coverage (3), we dedicate Section 2 to question (Q.i) above: what types of coverage are even possible? We devote Sections 3 and 4 to question (Q.ii): the development of methodologies that can guarantee the coverage (3). We ﬁrst (Sec. 3) provide a general recipe, while in Section 4 we provide more tailored methods for large output spaces, such as those in structured prediction. To provide some initial insights into the methods and potential applications, we provide experiments on several real-world domains; in the main body (Section 5) we investigate ranking, while the appendices (see Appendix C) provide additional examples with structured prediction, matching for pedestrian tracking in videos, and prediction intervals for county-level voting in the United States.

1.1 Related Work

An extensive line of work addresses prediction with partially labeled data. The major focus is on strong label recovery under weak supervision, including in multiclass (Cour et al., 2011; Nguyen and Caruana, 2008) and multilabel (Yu et al., 2014) tasks as well as structured prediction problems, such as ranking (H ullermeier et al., 2008; Korba et al., 2018), segmentation (Triggs and Verbeek, 2008; Papandreou et al., 2015), and natural language processing (Fernandes and Brefeld, 2011; Mayhew et al., 2019). More recent work tackles constructing strongly labeled datasets from disparate weak supervision tasks (Ratner et al., 2017; Zhang et al., 2017), while Cid-Sueiro et al. (2014), van Rooyen and Williamson (2018), and Cabannes et al. (2020) provide generic theoretical conditions allowing strong label recovery. Yet this literature focuses primarily on point prediction problems, where a model only returns a single label with the (putative) highest likelihood, in contrast to our conﬁdence-based approach, which provides calibrated uncertainty estimates and guarantees valid conﬁdence sets with virtually no distributional assumptions.

Predictive Inference with Weak Supervision

Our work also connects to the substantial literature on conformal inference, where the goal is to provide valid predictive conﬁdence sets (1). Vovk et al. (2005) introduce the main techniques that examples are exchangeable, and so essentially can provide p-values for signiﬁcance of one-another and suggest the simple and generic split-conformal algorithm for building valid conﬁdence sets. Essentially all conformalized conﬁdence sets oﬀer the coverage guarantee (1), so it is of interest to improve various aspects of the mappings b Cn. For example, works focus on improving the precision of these methods and optimizing average conﬁdence set size (Lei et al., 2018; Sadinle et al., 2019; Hechtlinger et al., 2019; Romano et al., 2019a; Angelopoulos et al., 2020), or on bridging the gap with other forms of coverage, like classwise (Sadinle et al., 2019) or conditional (Romano et al., 2019b; Barber et al., 2021; Cauchois et al., 2021; Romano et al., 2020; Cauchois et al., 2020) coverage. Along these lines, Bates et al. (2021) generalize conformal inference to oﬀer error control with respect to loss functions beyond the 0-1 loss (coverage or non-coverage) central to the guarantee (1), taking, as we do, structured prediction problems as motivation. Bates et al. focus on settings where the loss function naturally reﬂects the structure of the label space Y, such as hierarchical classiﬁcation problems where one wishes to label an example X at a resolution (level of the tree) appropriate to the conﬁdence with which it can be labeled. We view our approaches as complementary to theirs: their approaches make sense for scenarios with fully labeled data in which a particular loss function is natural, for example in treestructured hierarchical classiﬁcation, where a prediction can be made at a given level in the tree. Conversely, our approaches are sensible when one receives weakly supervised data and wishes to make a single good prediction; think of a grocery store deciding which of a large collection of shaving creams to stock, a ranking problem where one wishes to make sure that each individual s desired shaving cream is stocked; in the context of Example 1, Y weak is then the preferred shaving cream (top-1), and the guarantee (3) corresponds to top-1 coverage. In that respect, our approach relates to the expanded admission problem (Fisch et al., 2021), which allows for multiple labels to be admissible , except that we do not observe strongly supervised labels. Consequently, we motivate our distinct coverage guarantees from a set of impossibility results we present in the next section. Additionally, we pay special attention (see Sec. 4) to developing practical algorithms that scale to large label spaces, an important consideration with real-world weak supervision.

Notation Throughout this paper, [n] stands for the set {1, 2, . . . , n}. We use C : X Y to denote a set valued mapping C : X 2Y := {W | W Y}. P is either the probability distribution generating the data (X, Y, ϕ) X Y Φ, or equivalently (X, Y, W) X Y 2Y, as both notations are equivalent for our purpose, and U Uni[0, 1] deﬁnes a uniform random variable on [0, 1]. S(U, V ) is the set of bijections between two sets U and V , and we use the shorthand SK := S([K], [K]) for permutations; (i, j) is the transposition of elements i and j [K], and for k N, k := {p Rk + | p T 1 = 1} is the space of probability distributions on [k].

2 Conformal inference with weakly supervised data

The starting point of this paper is to delineate realistic goals in weak-conformal inference by determining what is actually possible as we show, a form of weak coverage and what is unachievable. To that end, we demonstrate that strong coverage (1), while desirable,

Cauchois, Gupta, Ali, and Duchi

may yield large and diﬃcult to interpret prediction sets. For example, as a consequence of Corollary 4 to come, in the ranking example 1, if feedback always consists of a paired comparison (e.g., item j1 preferred to item j2), then strong coverage necessitates a prediction set of size at least order k! (k

e)k. We thus relax our goals, presenting a general weak conformal scheme (Section 2.2) that relies on weakly supervised data.

2.1 The strong coverage dilemma with partially supervised data

Consider a fully supervised classiﬁcation setting with feature space X and output space Y, and let Pstrong be a joint distribution on X Y representing strong, as opposed to weak,

supervision. In this fully supervised setting, we observe (Xi, Yi) iid Pstrong, in contrast to observing a weak label set W Y satisfying only Y W. We ﬁrst require deﬁnitions of consistency and validity.

Deﬁnition 1 A probability distribution P on (X, Y, W) X Y 2Y is consistent if P(Y W) = 1. For any consistent distribution P, Pweak and Pstrong denote the marginal distributions of (X, W) and (X, Y ), respectively, when (X, Y, W) P.

Deﬁnition 2 Let b Cn : X Y be a (potentially randomized) procedure depending only on the weakly supervised sample (Xi, Wi)n i=1 X 2Y. Then b Cn provides (1 α)-strong distri-

bution free coverage if for all consistent distributions P on X Y 2Y and (Xi, Yi, Wi)n+1 i=1 iid P, we have coverage (1), i.e.,

P h Yn+1 b Cn(Xn+1) i 1 α,

and b Cn provides (1 α)-weak distribution free coverage (3) if

P h Wn+1 b Cn(Xn+1) = i 1 α.

With these deﬁnitions, we can provide the (negative) result that, on average over the data set, any procedure satisfying strong distribution free coverage (1) must include every individual label y Wn+1 with probability at least 1 α. To formalize this, for a conﬁdence set mapping b Cn : X Y constructed with (Xi, Wi)n i=1, deﬁne the function

pn(x, y) := P y b Cn(x) ,

which is the probability, taken over the weakly supervised sample (Xi, Wi)n i=1, that b Cn(x) contains the potential label y. We prove the following theorem in Appendix B.1.1.

Theorem 3 Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage. Then for all consistent distributions P on X Y 2Y,

E{(Xi,Wi)}n+1 i=1 iid Pweak

inf y Wn+1 pn(Xn+1, y) 1 α.

Predictive Inference with Weak Supervision

Theorem 3 essentially states that b Cn simultaneously includes each element y Wn+1 with probability at least 1 α. The theorem is generally not improvable, as Wn+1 need not be a subset of b Cn(Xn+1). Indeed, think of the trivial procedure b Cn that includes every label y Y independently with probability 1 α: it obviously satisﬁes strong distribution-free coverage but has no connection with Wn+1. As an additional immediate corollary, if the sets W contain at least a ﬁxed number of labels, then so does b Cn(Xn+1).

Corollary 4 Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage, and that P(|W| L) = 1 for some L 1. Then

E{(Xi,Wi)}n+1 i=1 iid Pweak

h b Cn(Xn+1) i L(1 α).

Proof By Theorem 3,

E h b Cn(Xn+1) i = E

y Y pn(Xn+1, y)

E |Wn+1| inf y Wn+1 pn(Xn+1, y) L(1 α)

as claimed.

Recalling Example 1, top-item feedback necessitates (k 1)! sets for ranking; multi-label recognition (Ex. 2) similarly necessitates an exponentially large set b Cn. An alternative perspective is to consider large-sample limits; often, the procedure b Cn converges to some population conﬁdence set mapping C : X Y as n , in that

E h b Cn(X) C(X) i 0 (4)

as n , where the expectation is over both the construction of b Cn and X independent of (Xi, Wi) iid Pweak. Typically, the limiting C is a (nearly) deterministic function1 of x; for example, the standard construction (e.g. Vovk et al., 2005; Lei, 2014; Barber et al., 2021) takes C(x) = {y Y | s(x, y) τ} for some scoring function s : X Y R and threshold τ, which is deterministic. In this case, we can show that we nearly have W C(X), so C(X) must be large whenever W is. To formalize, let

Det C(x) := {y Y | P(y C(x)) {0, 1}}

be the labels that are deterministically in or out of C(x) (where the probability is over any randomization in the mapping C) so that Det C(x) = Y whenever C is deterministic. Then can show that W C(X) with probability at least 1 α:

Corollary 5 Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage and satisﬁes the limit (4). Then

P(W Det C(X) C(X)) 1 α.

1. In some cases, we use randomization over a single label to guarantee that P(Y C(X)) = 1 α

Cauchois, Gupta, Ali, and Duchi

Appendix B.1.2 proves a slightly stronger result (which we state as Corollary 8). Theorem 3 and its corollaries suggest that any procedure achieving strong (distributionfree) coverage necessarily produces ineﬃcient (large) conﬁdence sets when one uses only weakly supervised data. Even in cases where there is implicitly a single correct label, such as the structured prediction problems Cabannes et al. (2020) consider, where the weak labels w that a single x supports (those for which P(W = w | X = x) > 0) have a single label y in their intersection w:P(w|x)>0{w} = {y}, large weak sets W remain possible. We thus must take a diﬀerent tack, targeting new coverage desiderata.

An aside: regression. Our development applies to regression or other problems with continuous or inﬁnite response sets, e.g., Y = R, as nothing in Theorem 3 or Theorem 9 to come requires Y to be any particular space. We leverage this in our experiments (Sec. 5 and Appendix C) to give numerical examples, touching on the R-valued case here to demonstrate the analogues of our theoretical results. As an example, weak sets W in the continuous case may be intervals, arising, for example, from measurements with limited resolution. We adapt Corollary 4 to regression by replacing counting measure with the Lebesgue measure Leb, where the response set Y = R and the weak sets W R. Assuming that the weak sets all have a minimal volume, any valid conﬁdence set mapping necessarily is large (on average) as well:

Corollary 6 Suppose that b Cn : X Y = R provides (1 α)-strong distribution free coverage, and let L > 0. If P(Leb(Wi) L) = 1, for i = 1, . . . , n + 1, then

E h Leb b Cn(Xn+1) i L(1 α).

This follows because any measure on Y gives an analogous result:

Corollary 7 Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage, let L > 0, and µ a measure on Y. If P(µ(Wi) L) = 1 for i = 1, . . . , n + 1, then

E h µ b Cn(Xn+1) i L(1 α).

Proof By Fubini s theorem and Theorem 3,

E h µ b Cn(Xn+1) i = E Z

Y 1 n y b Cn(Xn+1) o dµ(y)

Y p(Xn+1, y)dµ(y) E inf y Wn+1 p(Xn+1, y)µ(Wn+1) L(1 α),

as claimed.

The extension of Corollary 5 follows from Corollaries 6 and 7, implying Corollary 5 as a special case. (See Appendix B.1.2 for the proof.)

Corollary 8 Let µ be any measure on Y such that µ(W) > 0 with probability 1. Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage and C : X Y is its limiting mapping, i.e., limn E[µ( b Cn(X) C(X))] = 0. Then

P(W Det C(X) C(X)) 1 α,

and in particular, if C is deterministic, then P(W C(X)) 1 α.

Predictive Inference with Weak Supervision

2.2 A general weak-conformal scheme via scoring functions

The theoretical limitations we identify motivate the weak coverage (3) we target instead of the strong converage (1). Following our discussion above, the new coverage deﬁnition stems from two desiderata: if the problem is actually low-noise and there already exists a highly predictive model we can leverage to build our conﬁdence sets roughly, that conditional on x, a single label y belongs to the weak sets W with high probability and a model exists that can predict this y then while we should return this singleton even if we cannot guarantee strong coverage. In the alternative perspective that we care only about the value ϕ(Y ) recall the weak set (2) providing any y satisfying ϕ(y) = Y weak should suﬃce for prediction. We turn now to provide our general weak conformalization scheme. Our starting point is via the typical output of a machine-learned model, a scoring function s : X Y R that ranks potential labels (or responses) y for an input example x X. We treat s(x, y) is a non-conformity score, meaning the model predicts that values of y for which s(x, y) is small are more likely. Standard examples of such scoring functions include s(x, y) := |y bµ(x)| in regression, where bµ : X R predicts y | x; or s(x, y) := log py(x) in multiclass classiﬁcation, where py(x) models the conditional probability of y | x. Throughout this section, we adopt a split-conformal perspective (Vovk et al., 2005; Barber et al., 2021), assuming the practitioner provides a scoring function independent of the sample (Xi, ϕi, Y weak i )n i=1 (the sample would typically be a validation set), and we show how to transform any such scoring function into a valid weakly-covering conﬁdence mapping.

Algorithm 1 Partially supervised conformalization

Input: sample {(Xi, Y weak i , ϕi)}n i=1; score function s : X Y R independent of the sample; desired coverage 1 α (0, 1) For each i [n], compute

Si := min y: ϕi(y)=Y weak i s(Xi, y). (5)

Set btn := (1 + n 1)(1 α)-quantile of {Si}n i=1. Return: predictive set mapping b Cn : X Y deﬁned by

b Cn(x) := y Y | s(x, y) btn .

Algorithm 1 starts from a simple observation, assuming that the scoring function s is accurate, so that likely y achieve small scores s(x, y). Given a query function ϕ and weak label Y weak equivalently, the weak set W = {y | ϕ(y) = Y weak} the most likely label should typically be the y minimizing s(x, y) over all y satisfying ϕ(y) = Y weak. An equivalent scheme to the scores (5) with label mappings ϕ and weak labels Y weak i uses weak sets Wi, where we replace the scores (5) with

Si := min y Wi s(Xi, y).

Cauchois, Gupta, Ali, and Duchi

As a concrete example, for R-valued predictions with s(x, y) = |y bµ(x)| and interval weak sets Wi = [li, ui], we have Si = [li bµ(Xi)]+ + [bµ(Xi) ui]+. In either case, Algorithm 1 achieves valid weak coverage (3):

Theorem 9 Let (Xi, Yi, ϕi)n+1 i=1 iid P and Y weak i = ϕi(Yi) for i [n + 1]. Then Algorithm 1 returns a conﬁdence set mapping satisfying

P h There exists y b Cn(Xn+1) s.t ϕn+1(y) = ϕn+1(Yn+1) = Y weak n+1 i 1 α.

Proof Let Si := minϕi(y)=Y weak i s(Xi, y) for each i [n + 1]. By deﬁnition of b Cn, we have

n y b Cn(Xn+1) | ϕn+1(y) = Y weak n+1 o = n y Y | ϕn+1(y) = Y weak n+1 and s(Xn+1, y) btn o ,

which is nonempty if and only if

Sn+1 := min ϕn+1(y)=Y weak n+1 s(Xn+1, y) btn.

As {Si}n+1 i=1 are i.i.d., this occurs with probability at least 1 α (e.g. Tibshirani et al., 2019, Lemma 1).

3 Constructing eﬀective conformal prediction sets

Algorithm 1 provides a generic method for conformalization in the presence of partially supervised data, and it makes no assumptions on the input score function s. Though the coverage guarantee (3) holds regardless, we can delineate a few additional desiderata that the predictive sets and score functions s should satisfy to make them more practically useful, which is our focus in this section:

The score function s must allow the practitioner to eﬃciently carry out the computation of the partial inﬁmum scores (5).

The lower level sets b Cn(x) = {y Y | s(x, y) btn} should be eﬃciently representable.

The conﬁdence sets b Cn(x) should be small, as smaller conﬁdence sets (for a ﬁxed conﬁdence level α) carry more information.

Deferring our discussion of computational eﬃciency to Section 4, in this section we only focus on the last desideratum, implicitly assuming computation is tractable (for example, that Y is small). We ﬁrst (Sec. 3.1) develop conditions suﬃcient for optimally-sized conﬁdence sets to even exist a few subtleties arise before giving greedy algorithms for conﬁdence set-size minimization, describing their properties, providing a few optimality guarantees in Section 3.2, and connecting to submodular minimization in Appendix A.

Predictive Inference with Weak Supervision

3.1 Size-optimal scoring mechanism

As in standard approaches to conformal inference (Lei, 2014; Lei and Wasserman, 2014; Barber et al., 2021), we aim to construct a conﬁdence set mapping b Cn with minimal average size over X PX. Our starting point is simply to deﬁne size-optimality, where to achieve exact coverage and size guarantees, we allow randomization of our conﬁdence sets via an independent variable U Uni[0, 1].

Deﬁnition 10 A randomized conﬁdence set mapping C1 α : X [0, 1] Y is marginally size-optimal at level α if it solves

minimize C:X [0,1] Y EX,U Uni[0,1] [|C(X, U)|]

subject to P(W C(X, U) = ) 1 α. (Marg)

It is conditionally size-optimal at level α if for almost every x X, C(x, ) solves

minimize C:[0,1] Y

EU Uni[0,1] [|C(U)|] s.t. P(W C(U) = | X = x) 1 α . (Cond)

Even with full knowledge of the distribution P, techniques for ﬁnding marginally sizeoptimal conﬁdence sets (Marg) are not immediately apparent; as a consequence, we focus on the conditional case ﬁrst. Even in this case, it is in general non-trivial to obtain smallest conﬁdence sets. Yet as we follow the standard practice in conformal prediction of deﬁning conﬁdence sets via the scores s (recall Alg. 1) as Ct(x) = {y | s(x, y) t}, our conﬁdence sets have the natural nesting property that Ct(x) Ct (x) whenever t < t . Abstracting away the particular form of C to enable a purely set-based focus, we thus consider nested conﬁdence sets, where we show that optimality guarantees are possible.

Deﬁnition 11 A collection of mappings {Cη : X [0, 1] Y}η (0,1) is nested if

P(Cη1(X, U) Cη2(X, U)) = 1 for all 0 < η1 < η2 < 1.

There is an immediate equivalence between score-based conformalization schemes and nested collections of conﬁdence mappings (Gupta et al., 2022): we simply deﬁne

snest(x, y, u) := inf {η (0, 1) | y Cη(x, u)} . (6)

The next lemma formalizes this equivalence (see Appendix B.2.1 for a proof).

Lemma 12 Assume the conﬁdence set mappings {Cη}η (0,1) are nested and snest(x, y, U) has continuous distribution for U Uni[0, 1]. Then

Cη(x, U) = y Y | snest(x, y, U) η with U-probability 1.

That is, obtaining weak coverage for nested conﬁdence mappings is equivalent to obtaining weak coverage using the scoring function snest, which Alg. 1 provides; that is, it is equivalent to choosing the smallest η (0, 1) such that P(W Cη(X, U) = ) 1 α. A second useful distributional property of the nested scores (6) is that, assuming the conﬁdence sets Cη are conditionally valid, we can provide strong distributional results on snest. To make this

Cauchois, Gupta, Ali, and Duchi

precise, we say that Cη is conditionally valid for the weak labels W if for each η (0, 1) and with P-probability 1 over X,

P (Cη(x, U) W = | X = x) = η. (7)

We then have the following uniformity property as an immediate consequence of Lemma 12:

Lemma 13 In addition to the conditions of Lemma 12, assume that Cη is conditionally valid (7) for the weak label W. Then the minimum score (5) is independent of X and satisﬁes

inf y W snest(x, y, U) Uni[0, 1].

Proof With U-probability 1, infy W snest(x, y, U) η if and only if Cη(x, U) W = , and so P(infy W snest(x, y, U) η | X = x) = P(W Cη(x, U) = | X = x) = η.

Einbinder et al. (2022, Prop. 1) gives a similar result to Lemma 13, where the conformity scores they introduce also induce prediction sets satisfying the nested property. To illustrate this lemma, suppose that there exist nested conditionally size-optimal mappings {Ccond η }η (0,1) solving problem (Cond): in that case, they satisfy the conditions for application of Lemma 13 so that the induced scores Si are uniform; Alg. 1 will thus compute btn = (1 α)+OP (n 1/2) as btn is the (1 α) quantile of Si iid Uni[0, 1]. So in the case that we have (near) conditional coverage Alg. 1 maintains it. Notably, given a score function s, not necessarily the nested score (6), but strong in the sense that it models (X, Y ) well enough that for each α, we can choose t so that P(s(x, Y ) t | X = x) = α, then the conﬁdence sets Algorithm 1 returns are indeed nested, and Lemma 13 applies to the induced nested score snest. Optimal nested sets need not always exist (see Example 3 below), but we can provide natural conditions on the distribution of W | X = x suﬃcient to allow such nested coverage, which we do in the next subsection.

3.1.1 From conditionally to marginally valid confidence sets

Our initial criterion (Marg) is purely marginal: we wish to compute a marginally sizeoptimal conﬁdence set. Conveniently, conditionally size-optimal mappings can yield marginally size-optimal problems. In particular, assume that the mappings {Ccond η }η (0,1) are conditionally size-optimal (Cond) and satisfy P(W Ccond(x, U) = | X = x) η. The following proposition shows how to transform these into marginally size-optimal conﬁdence sets.

Proposition 14 Let the mappings {Ccond η } be conditionally size-optimal (Cond) as above, and deﬁne the average size(x, η) := EU[|Ccond η (x, U)|]. Let smarg be any minimizer of

E [size(X, s(X))] s.t. E[s(X)] 1 α

over s : X [0, 1]. Then a solution to the initial marginal problem (Marg) is

Cmarg 1 α (x, u) := Ccond smarg(x)(x, u).

Predictive Inference with Weak Supervision

More directly, any conditionally size-optimal sets which are at least easier to characterize as they need only randomize over U Uni[0, 1] yield marginally size-optimal conﬁdence sets in a relatively straightforward way: one chooses the probability of miscoverage, s(x), minimizing the expected conﬁdence set size. Proof That CMarg 1 α provides valid 1 α coverage is nearly immediate: by conditional size optimality, we have P(W Cmarg 1 α (X, U) = ) = E[tmarg(X)] 1 α. Let C be any conﬁdence set mapping such that P(W C(X, U) = ) 1 α, and deﬁne s C(x) := P(W C(x, U) = | X = x) [0, 1], which satisﬁes E[s C(X)] 1 α. By assumption on Ccond, for each ﬁxed x X, the set Ccond s C(x)(x, U) is size-optimal (Cond) at level s C(x), so that for PX-almost every x X, we have

size(x, s C(x)) = EU Uni[0,1] h |Ccond s C(x)(x, U))| i EU Uni[0,1] [|C(x, U)|] .

Integrating both sides of the inequality over X PX, and using the assumed optimality condition on smarg, we obtain

E [size(X, smarg(X))] E [size(X, s C(X))] EX,U [|C(X, U))|] .

The left-hand size is the average size of Cmargin 1 α .

3.2 Greedy algorithms for conﬁdence set-size minimization

Given the distribution or a model of the distribution of the weak set W conditional on x, we propose a natural greedy algorithm to construct a conﬁdence set satisfying the weak coverage constraint: at each step, Algorithm 2 adds the label that increases coverage the most until the conﬁdence set achieves a desired level. Algorithm 2 draws inspiration from Romano et al. s Algorithm 1 2020, where the authors formulated conformal inference methods tailored for categorical and unordered response labels. These methods not only ensure valid marginal coverage but also aﬀord approximate conditional coverage. As we show presently, there are natural families of distributions where this greedy algorithm is optimal; however, there are failure modes, of which we also provide an example. In Appendix A, we relate this greedy construction to submodular optimization to provide general guarantees of conﬁdence set size and coverage. Alg. 2 returns a nested sequence {Cgr η (x, U)}η (0,1), where U Uni[0, 1] randomizes to achieve an appropriate level. While the sequence need not necessarily solve problem (Cond) (see Example 3 to come), there are natural suﬃcient conditions for Algorithm 2 to return a size-optimal set, of which we present two. As the ﬁrst particular case, consider that conditional on x, labels y Y belong to W independently:

Deﬁnition 15 A probability distribution P on W 2Y has label-independent structure if {1{y W}}y Y are independent random variables when W P.

We might expect W to exhibit label independence when all labels y Y satisfy π(y | x) 1, with the exception of a single label y (x), for which π(y (x) | x) 1, as will often be the case in low-noise classiﬁcation settings.

Cauchois, Gupta, Ali, and Duchi

Algorithm 2 Greedy weakly supervised scoring mechanism

Input: model for the distribution of W given X = x; coverage rate η (0, 1) for each j [K] deﬁne recursively

yj(x) := argmax y Y P

i=1 {yi(x) W} | X = x

for each j [K] deﬁne Cgr,j(x) := {yi(x) | i j} and set

j(x, η) := min j [K] | P W Cgr,j(x) = | X = x η .

tη(x) := η P(Cgr,j(η,x) 1(x, U) W = | X = x) P(Cgr,j(x,η)(x, U) W = | X = x) P(Cgr,j(x,η) 1(x, U) W = | X = x).

return function Cgr η : X [0, 1] Y deﬁned by

Cgr η (x, u) :=

( Cgr,j(x,η)(x) if u < tη(x), Cgr,j(x,η) 1(x) otherwise.

Figure 1. A tree-structured (8) distribution for W given X = x, with Y = {1, 2, 3, 4}. The possible conﬁgurations for W are the singletons {y1}, {y2}, {y3}, {y4}, the two pairs W1 = {y1, y3} and W2 = {y2, y5}, W0 = {y1, y2, y3, y5}, and Y itself.

Another scenario occurs when the label space exhibits a hierarchical tree structure, as one may expect in image classiﬁcation (Deng et al., 2009) or structured prediction tasks (Cabannes et al., 2020). When the weak sets W obey the same structure as the distribution they are subtrees of the global tree we say the labels have a tree structure (see Figure 1):

Deﬁnition 16 A probability distribution P on W 2Y has a tree structure if for all w1, w2 Y,

P(W = w1) > 0 and P(W = w2) > 0 imply w1 w2 {w1, w2, }. (8)

Both deﬁnitions (independent labels and hierarchically-structured weak labels) are sufﬁcient to guarantee size-optimality for the greedy conﬁdence sets Algorithm 2 constructs. The next Proposition, whose proof we provide in Appendix B.2.2, makes this formal.

Predictive Inference with Weak Supervision

Proposition 17 Suppose the probability law L(W | X = x) has either label-independent structure (Def. 15) or a tree structure (Def. 16). Then for all η (0, 1), Cgr η is conditionally size-optimal, and therefore is a minimizer in equation (Cond).

In general, even with perfect knowledge of the distribution of W | X = x, the nested greedy conﬁdence sets Cgr η need not be size-optimal, as there may be weak sets appearing with high probability while their constituents do not, so that the conditionally size-optimal sets {Ccond η }η [0,1] are not nested. The next example illustrates one such failure mode: Example 3: Let the distribution of W be

{1, 2} w.p. 0.3 {1, 3} w.p. 0.25 {2} w.p. 0.2 {3} w.p. 0.15 {1} w.p. 0.1.

Then for η = 0.9, it is immediate that Ccond η (x, u) = {2, 3}, but Cgr η (x, u) = {1, 2, 3} or {1, 2} depending on whether u < 1/3. In addition, Ccond eta (x, u) = {1, 3} when η = 0.85, showing that in this case, the conﬁdence set mappings Ccond η need not be nested. 3

In Appendix A, we include a few ancillary results showing that even in general cases like Example 3, the sizes of the conﬁdence sets Cgr and Ccond cannot be too far apart.

4 Eﬃcient conformalization for large output spaces

While Section 3 provides a generic treatment on for producing scoring functions and associated conﬁdence sets of minimal size, in typical practice, a (pre-trained) model provides a predictive scoring function, which may not be directly associated to a probability metric, and we wish to leverage such models. This is of particular interest when the label space Y is large, as in structured prediction problems (Taskar, 2005; Cabannes et al., 2020), where computational eﬃciency becomes a main challenge. In this section, we thus ﬁrst introduce a general method for computing and representing conﬁdence set mappings of the form {y | s(x, y) ˆtn}, and then describe how to eﬃciently carry out Alg. 1 in ranking problems (Section 4.2); for interested readers, we include constructions for matching problems in Appendix C.1 problems.

4.1 Conformal conﬁdence sets with sequential partitioning

We seek to eﬃciently compute and represent the conﬁdence set b Cn(x) for any instance x X, typically for a task where the label space contains more conﬁgurations than are eﬃciently enumerable (K! for matching and ranking problems over K items). At the same time, recalling that btn denotes the threshold Algorithm 1, if our conﬁdence sets are to be informative they should include relatively few conﬁgurations y Y satisfying s(x, y) btn. To the end of computing the set b Cn(x) = {y | s(x, y) btn} in Alg. 1, we focus on methods for computing a given number M of conﬁgurations with the smallest score s(x, y). This is essentially without loss of generality: while we may not know the appropriate M = Mx =

Cauchois, Gupta, Ali, and Duchi

Partition Update

y3 1,2 y3 3,2

y3 1,2 y3 3,2 y3

y3 2,2 = y4

y4 1,2 y4 3,2

Figure 2. Alg. 3 scheme for sequential partitioning: ﬁrst, partition the subset containing the m + 1-th best conﬁguration, y3 2,2 in this case, then compute both second-best conﬁgurations in the newly formed subsets of the partition here Y4 2 and Y4 4.

| b Cn(x)| to guarantee coverage, if for each M N we can ﬁnd the M best conﬁgurations in time polynomial in M, then by sequentially doubling M until we obtain an element y Y such that s(x, y) > btn, we achieve time polynomial in Mx. Algorithm 3 builds on this intuition to return a valid conﬁdence set. The Algorithm we suggest is essentially an extension of the algorithm proposed by Chegireddy and Hamacher (1987) to ﬁnd the Kbest matchings in a bipartite graph. We reuse the general idea (i.e. compute a sequence of partitions of the space and maintain a list of the two best conﬁgurations for each item of the partition) and extend it to a more general structured prediction. The key is to observe that we only need to be able to compute the two best conﬁgurations of a given subset of conﬁgurations, which they do on matching problems (and our Algorithm essentially reduces to theirs in the matching case). We then apply that paradigm to the ranking case. We remark brieﬂy that an alternative approach is to conformalize directly on the size M of the conﬁdence set: suppose we learn a function c M : X N predictive of the rank (according to {s(x, y)}y Y) of the ﬁrst compatible conﬁguration, i.e predictive of

Mi := rank of the ﬁrst conﬁguration y Y such that ϕi(y) = Y weak i .

In that case, if we let b Qn := 1 + n 1 (1 α) -quantile of {Mi c M(Xi)}n i=1, we would only need to return

b Cn(x) := n c M(x) + b Qn best conﬁgurations y Y ordered by s(x, y) o .

This approach makes prediction more eﬃcient (as we know in advance the number of conﬁgurations to compute), but the computational eﬀort of the conformalization step (5) increases, as we must compute the rank of the best constrained conﬁguration for each instance.

4.1.1 Returning M best configurations with sequential partitioning

Let us now ﬁx M 1, and focus on retrieving the M conﬁgurations with the lowest scores. Algorithm 3 provides a general recipe using dynamic programming, and it is eﬃcient as long as we can eﬃciently compute certain partitions of the label space. We require the following deﬁnition.

Predictive Inference with Weak Supervision

Deﬁnition 18 A function Partition : 2Y Y Y 2Y 2Y is valid for a score function s if, for every subset Y Y and pair of conﬁgurations y1, y2 Y satisfying

y1 argmin y Y s(x, y) and y2 argmin y Y\{y1} s(x, y),

Partition( Y, y1, y2) returns a partition ( Y1, Y2) of Y such that y1 Y1 and y2 Y2.

We thus leverage two conditions: a valid Partition for our score function s and, for each pair of subsets Y1, Y2 Y that it produces, we must be able to (eﬃciently) compute the second-best conﬁgurations in Y1 and Y2, i.e.,

y1,2 argmin y Y1\{y1} s(x, y) and y2,2 argmin y Y2\{y2} s(x, y).

Figure 2 encapsulates the main idea Alg. 3: at each step m [M], we maintain a partition {Ym j }m j=1 of Y such that if ym j argminy Ym j s(x, y), then for all j [m], we have

ym j argmin y Y\{ym 1 ,...,ym j 1} s(x, y),

i.e., ym j is the j-th best conﬁguration in Y. Now, for each j [m], let the conﬁguration ym j,2 argminy Ym j \{ym j } s(x, y) be the second-best conﬁguration in Ym j . The key is then to observe that if we set

ind(m) := argmin j [m] s(x, ym j,2),

then ym ind(m),2 is the (m+1)st best conﬁguration in Y. The Partition function then divides

Ym ind(m) into two sets Ym+1 ind(m) and Ym+1 m+1 such that ym ind(m) Ym+1 ind(m) and ym ind(m),2 Ym+1 m+1. Under the assumption that Partition is valid (Def. 18) for the score s, the following lemma guarantees the validity of Algorithm 3.

Lemma 19 Assume the Partition function is valid for the score function s. Then Algorithm 3 returns a set of conﬁgurations {yj}M j=1 such that for each j [M],

yj argmin y Y\{y1,...,yj 1} s(x, y).

Proof This follows by an induction over m 1, which guarantees that at every step m 1, {Ym j } is a partition of Y such that ym j = argminy Ym j and

s(x, ym 1 ) s(x, ym 2 ) s(x, ym m) min y Y\{ym j } s(x, y).

The property transitions from m to m+1 as the Partition function is valid, and we choose ym+1 m+1 as the best second-best conﬁguration, hence it is the (m + 1)st best conﬁguration.

The existence of an eﬃcient valid Partition function is instance-dependent and typically requires a speciﬁc choice of scoring function; we provide concrete implementations for two types of structured prediction problems.

Cauchois, Gupta, Ali, and Duchi

Algorithm 3 Sequential partitioning Require: score function s : X Y R; valid (Def. 18) Partition: 2Y Y Y 2Y 2Y; instance x X initialize: Compute y1 1 := argminy Y s(x, y) and y1 1,2 := argminy Y\{y1 1} s(x, y). Set Ym 1 := Y {Initialize the partition} for m = 1, 2, . . . , M 1 do

ind(m) := argminj [m] s(x, ym j,2) {Find the m + 1-th best conﬁguration} ym+1 m+1 := ym ind(m),2 for j [m] \ {ind(m)} do

(Ym+1 j , ym+1 j , ym+1 j,2 ) := (Ym j , ym j , ym j,2) {All subsets {Ym j }j =ind(m) remain identical} end for Ym+1 ind(m), Ym+1 m+1 := Partition(Ym ind(m), ym ind(m), ym+1 m+1) {Partition the set Ym ind(m)}

ym+1 ind(m) := ym ind(m) and ym+1 ind(m),2 := argminy Ym+1 ind(m)\{ym+1 ind(m)} s(x, y)

ym+1 m+1,2 := argminy Ym+1 m+1\{ym+1 m+1} s(x, y) {Compute second-best conﬁgurations} end for return {y M m }M m=1

4.2 Structured prediction examples (Ranking problems and partial labeling mechanisms)

While Algorithm 3 is generic, we now show that eﬃcient partitioning and minimization functions exist in structured prediction instances, so that we may eﬃciently carry out above algorithms in the instance. Here we focus on ranking problems and defer the discussion on matching tasks in Appendix C.1. The goal here is to predict a preference ranking y Y = SK of K diﬀerent items, documents, for a certain user or query x X, where y(i) denotes the item of rank i. Typically, one achieves this by learning relevance functions rk : X R, which evaluate each item 1 k K individually before aggregating into a single ranking prediction (Freund et al., 2003; Duchi et al., 2013; Qin and Liu, 2013; Cao et al., 2007). We assume here that we have access to such relevance functions. In ranking tasks, there are two reasonable ways in which practitioners may acquire partial supervision or user feedback. The ﬁrst mechanism (Cabannes et al., 2020) assumes they only receive a subset of all K 2 pairwise comparisons 1 y 1(i) < y 1(j)

1 i<j K as a partial label, which is especially relevant in cases where the practitioner solicits feedback from users by asking them to compare a small number of items. Unfortunately, carrying out the computation (5) in Alg. 1 reduces to the NP-hard minimum cost feedback arc set problem (Ailon et al., 2008; van Zuylen et al., 2007), for which only an approximate solution is available (by solving an integer linear program). Another form of feedback, on which we focus in the rest of the section and that allows running both Algs. 1 and 3 eﬃciently, instead assumes that users only provide a fraction of their preferred ranking and reveal (y(i))Kpartial i=1 for some Kpartial K (top-Kpartial feedback in Example 1). To construct score functions amenable to the application of Alg. 3, we ﬁrst introduce ranking-consistent score functions.

Predictive Inference with Weak Supervision

Deﬁnition 20 A scoring function srank is ranking-consistent with a set of relevance functions {rk : X R}k [K] if a for all 1 i < j K and y SK,

srank(x, (i, j) y)) srank(x, y) if ry(i)(x) ry(j)(x), (9)

where (i, j) y denotes transposition of i and j in the permutation y.

Such a scoring mechanism should always favor a ranking that gives a higher rank to y(j) than y(i) if ry(i)(x) ry(j)(x), i.e., if y(j) has a greater relevance than y(i). It ensures in particular that the (m + 1)st best ranking is always a neighbor of one of m best; this is an immediate property of the score function, as it can always increase by swapping two elements i and j that are mis-ordered. An example of ranking-consistent scoring function is the disagreement-based scoring function (Kendall, 1938; Kemeny, 1959; Duchi et al., 2013)

srank(x, y) := X

i<j ψ ry(i)(x), ry(j)(x) , (10)

where ψ : R2 R 0 is a function satisfying ψ(a, b) = 0 when a b and ψ(a, b) > 0 when a < b, non-increasing in the ﬁrst argument and non-decreasing in the second. Unless we specify otherwise we use ψ(a, b) = [b a]+ in our experiments. Finding the conﬁguration y that minimizes the partial score (5) of a ranking-consistent score function is straightforward: it suﬃces to rank all the elements in [K] \ {y(i)}Kpartial i=1 according to their relevance scores (rj(x))K j=1, and then append them to the ﬁrst Kpartial

elements. This property allows eﬃciently retrieving the M 1 best conﬁgurations with Alg. 3. Throughout the loop, we make sure that any set of permutations Ym j is a subset of permutations consistent with a ﬁnite number of partial rankings (pairwise comparisons), and that its best and second-best conﬁgurations ym j and ym j,2 only diﬀer by a neighboring transposition of the form (i + 1, i), satisfying

ym j,2 := argmin y Ym j {s(x, y) | i [K], y = (i + 1, i) ym j }. (11)

If we can guarantee this loop invariant, then there always exists ij,m [K] such that ym j,2 = (ij,m + 1, ij,m) ym j , and we only need to deﬁne the partition function on a smaller subset of 2Y Y Y: for any subset of permutations Y Y, y Y and i [K] such that (i + 1, i) y Y, we let

Partition Ranking( Y, y, (i + 1, i) y) := Y y Y | y 1( y(i)) < y 1( y(i + 1)) , Y y Y | y 1( y(i)) < y 1( y(i + 1)) , (12)

splitting Y according to whether y(i) has a higher rank than y(i + 1). The next lemma, whose proof is in Appendix B.3.1, states that this partition rule indeed guarantees that, at every step m of the loop in Algorithm 3, the second-best conﬁguration in Ym j satisﬁes the invariant (11).

Lemma 21 Assume the score function is ranking-consistent (9) for a set of relevance functions {rk}K k=1. Then Algorithm 3 with the Partition Ranking function (12) produces a sequence of partitions with second-best conﬁgurations satisfying equation (11).

That is, Algorithm 3 is correct.

Cauchois, Gupta, Ali, and Duchi

5 Experiments

In this section, we test our weakly supervised methods experimentally, in diﬀerent classiﬁcation and regression problems, on both synthetic and real datasets, with an emphasis on their computational eﬃciency and informativeness. Here we focus on ranking problems and present further experiments on matching and regression problems in Appendix C. The primary goal of this paper is not to provide end-to-end models with only partially supervised data, but rather to introduce a new form of coverage validity and show how to achieve it with partially labeled data. In contrast to the split-conformal method (Vovk et al., 2005), which requires fully supervised instances for both training and validating, we only need these to train a model and form a scoring function suitable for the application of Alg. 1. In some cases, standard models already exist, such as in image classiﬁcation (He et al., 2016). To provide a meaningful comparison with existing conformal methods and test for predictive set size eﬃciency, we use fully labeled real datasets, and introduce diﬀerent plausible forms of weak supervision on our calibration and test sets before applying Algorithm 1 to construct conﬁdence sets. Our method displays similar behavior across all datasets and forms of partial information. To provide a baseline, we also run a standard fully supervised conformal scheme (FSC) using the strong labels Yi and true scores s(Xi, Yi), which runs similarly as Alg. 1, but with threshold

btfull n := (1 + n 1)(1 α)-quantile of {s(Xi, Yi)}n i=1. (13)

We can then estimate the gain in eﬃciency in the form of decreased conﬁdence set sizes that stems from the weakening of strong coverage (1) to weak coverage (3).

5.1 A toy classiﬁcation example

We ﬁrst perform an experiment with a toy multiclass data set containing K = 10 diﬀerent classes and d = 2 dimensional features. We consider a partially supervised problem on X Y = Rd [K] for which we wish to output valid conﬁdence sets. We use the following model: each potential response y [K] has a noisy score depending on the feature vector X Rd though a vector θ y Rd,

{Soracle y }y [K] | X = x N({x T θ y}y [K], σ2IK) (14)

Ideally, we would recover the strong label Y := argminy Y Soracle y , but our weakly supervised methods do not observe Y directly: instead, for a random instance-dependent threshold T, we only have access to the weak set

W := {y Y | Soracle y T}.

As motivation, consider a supervised learning task in which, out of all potential responses, there is always only one ground truth, but there are other labels that are good enough (i.e. have a low enough score) to answer a certain query. In this setting, a conﬁdence set is weakly valid (3) as long as it contains at least one label y such that Soracle y T, whereas it is strongly valid (1) if it contains Y . We vary the signal-to-noise ratio σ 1 {10 2, . . . , 102}: when it is too small, no model (even an oracle one) can be highly predictive, and a standard conformal method should

Predictive Inference with Weak Supervision

Strong Weak

0.03 0.1 0.3 1 3 10 30 100 0.03 0.1 0.3 1 3 10 30 100

factor(SNR)

0.1 1.0 10.0 100.0 SNR

Average Set Size

σ 1 σ 1 σ 1

E h ˆC(X) i

P(Y b C(X)) P(W b C(X) = )

Figure 3. Results for the simulated multiclass data (14), over Ntrials = 20 trials. The left plot shows respectively the strong (1) and weak (3) coverage for the greedy weakly supervised (GWS), the weakly supervised conformal (WSC) and the fully supervised conformal (FSC) conﬁdence sets. The right plot displays the average conﬁdence set size for these methods.

provide large uninformative conﬁdence sets, whereas we expect our new deﬁnition of coverage to yield smaller sets, as any label in W (i.e. with a low enough score) provides valid coverage.

In this experiment, we compare three diﬀerent methods. The Greedy weakly supervised (GWS) method only uses partially labeled data both when training and conformalizating. It ﬁrst trains K separate logistic regressions with {Xi} as features and each {1{y Wi}} for all y Y as potential response, providing a model for P(y W | X = x), and models the distribution of W given X = x as label-independent (see Deﬁnition 15). It then computes a nested sequence of conﬁdence sets thanks to Alg. 2, which we then feed to the conformalization Algorithm 1 using the nested scoring mechanism (6).

The second and third methods, the Weakly supervised conformal (WSC) and the Full supervised conformal (FSC) methods respectively, use fully supervised data for training: we ﬁrst train a standard logistic regression model pθ(y | x) exp(θT y x) on {(Xi, Yi)}, and then construct a scoring function using the Generalized Inverse Quantile (GIQ) procedure that Romano et al. (2020) introduce. In the conformalization step, the WSC method runs Algorithm 2 with partially labeled calibration data, while the FSC method uses strongly labeled data to compute the threshold btfull n in (13). The threshold btn in Alg. 2 is always smaller than btfull n , so the FSC method returns larger conﬁdence sets than the WSC method. We expect that as the signal to noise ratio decreases, the gap between the GWS and WSC conﬁdence sets and the FSC conﬁdence sets increases.

Cauchois, Gupta, Ali, and Duchi

The precise experimental set-up is as follows: we simulate n = 104 data points, splitting them into training (30%), calibration (20%) and test (50%) sets. We draw each θy uniform

on Sd 1, {Xi}n i=1 iid N(0, Id), choosing weak threshold T Uni[miny Y{SOracle y }, maxy Y{SOracle y }]. We repeat the entire process Ntrials = 20 times to account for uncertainty, presenting our results in Figure 3. As we expect, using an alternative weaker version of coverage (3) allows us to signiﬁcantly decrease the size of the conﬁdence set (by up to a factor of 3), especially when the signal-to-noise ratio is small, as one must include more classes in the conﬁdence set to maintain strong coverage. Indeed, we can see that the strong coverage (1) for the GWS and WSC procedures fall well below 1 α = 95% in this case, since they only strive for weak 1 α coverage, which they consistently achieve. Since the GWS method aims to construct minimal conﬁdence sets, we expect that it produces smaller conﬁdence sets than the WSC method, which simply leverages an existing strongly supervised model; we consistently observe this across diﬀerent values of σ.

5.2 Document ordering for query answering

We now present the results of two experiments using Alg. 1 in a ranking problem. The ﬁrst simulates a standard ranking task, while the second focuses on ranking documents relevance to speciﬁc queries in the Microsoft LETOR dataset (Qin and Liu, 2013).

5.2.1 Ranking simulation study

In a ﬁrst simulation study, we aim to predict a ranking of labels y [K] based on a feature vector X Rd. Think here of a supervised problem where we want to rank users preferences for a set of items. Each user has an unknown relevance score SOracle y R for each item y [K], which induces a ground truth ranking over the labels:

Y := argsort{SOracle y }y [K] SK.

The problem is to recover this noisy ranking and produce valid conﬁdence sets in SK, but our weakly supervised methods do not observe the full ranking when conformalizing: they can only observe the ranking up to the Kpartial K-th element, leading to the weak set

W = {y SK | j [Kpartial], y(j) = Y (j)}. (15)

In our experiment, we simulate n = 104 i.i.d. diﬀerent users, using the same (30,20,50) train/validation/test split as in Section 5.1. With K = 7 and d = 2, we draw the user

feature vector Xi iid N(0, Id), and then conditionally on Xi, we produce normal itemwise relevance scores {SOracle iy }i [n],y [K] following the distribution (14). We ﬁnally simulate

partial supervision by drawing the number of observed elements in the ranking Kpartial i as

min(K, 1 + Ai), where Ai iid Poi(.5). The lower left panel of Figure 4 shows the overall distribution of this quantity: most users only reveal the ﬁrst 1 to 3 items in their optimal ranking. We then produce strongly and weakly valid conﬁdence sets at the 1 α := 90% level. We use the same scoring model for both the fully supervised conformal (FSC) and weakly

Predictive Inference with Weak Supervision

Strong Weak

0.03 0.1 0.3 1 3 10 30 100 0.03 0.1 0.3 1 3 10 30 100 0.4

factor(SNR)

1 2 3 4 5 6 7 size

1e 01 1e+00 1e+01 1e+02 1e+03 Size

CDF P(|C(X)| <= Size)

P(Kpartial i = t)

Size t Size t

P | b C(X)| t)

P(Y b C(X)) P(W b C(X) = ) A

Figure 4. Results for the ranking simulation study (15) over Ntrials = 20 trials. A: Strong (1) and Weak (3) coverage for the weakly supervised conformal (WSC) and the fully supervised conformal (FSC) conﬁdence sets. B: Density histogram of the variable Kpartial (15) governing the weak distribution in this example. C: Distribution of the conﬁdence set size | b C(X)| for diﬀerent signal-to-noise ratios σ 1.

Cauchois, Gupta, Ali, and Duchi

supervised conformal (WSC) procedures: we learn linear individual relevance score functions {ry}y [K] (with fully supervised training data) via the List Net procedure (Cao et al., 2007), which we brieﬂy describe here. Given a set of relevance scores {ry}y Y RK, List Net models the probability of a ranking π SK as

exp(rπ(y)) Pk l=y exp(rπ(l)) , (16)

which gives each item y Y a top-1 probability (of ranking ﬁrst) equal to

P 1 r (y) := Pr(π(1) = y) = exp(ry) Pk l=1 exp(rl) .

Given a training data set containing pairs (X, R) X RK of features/relevance scores, we learn score mappings by minimizing the log-loss of the top-1 distribution over a set F of functions

{ˆry}y [K] := argmin r FY

(X,R) training data

k=1 P 1 R(k) log P 1 r(X)(k)

In our experiment, we only observe the ranking (or even a fraction of), not the true peritem relevance scores, hence, following common practice (Cao et al., 2007), we use Ry = K the rank of the item = K Y 1(y) as a proxy for our observed item-wise relevance scores when training our model. In our experimental set up, each relevance score function ry : X R ideally estimates the true conditional mean of the oracle scores, x 7 x T θ y. Given these individual scores, we use the scoring mechanism (10) with ψ(x, y) := (y x)+ and conformalize using the strategy we describe in Section 4.2. The diﬀerence between the WSC and FSC methods is the conformalization step on calibration data: WSC runs Alg. 1 with partially supervised data to compute the score threshold btn while FSC uses strongly supervised data to return the more conservative threshold btfull n in (13). Our results ﬁt our initial expectations, in line with our ﬁrst experiments: the size of the conﬁdence set, as Figure 4C shows, beneﬁts from the weaker deﬁnition of coverage: for any value of the signal to noise ratio σ 1 > 0, the WSC method produces much smaller and more informative conﬁdence sets than the FSC method, as it only needs to include a ranking with the correct ﬁrst Kpartial elements to provide valid coverage. ,At the cost of the strong coverage falling below 1 α (see Fig. 4A), and with little information (only the ﬁrst Kpartial < K labels), the WSC method constructs predictive sets that are much smaller and yet still valid (in a weak sense).

5.2.2 Ranking experiment with Microsoft LETOR dataset

We now tackle a slightly diﬀerent type of ranking problem: we wish to rank a set of potential documents by order of relevance to a speciﬁc user query: documents more relevant to the query should occupy a higher position in the ﬁnal ranking. A search engine is a good example of such problem: a user makes a search query, and the task is to sort Web pages that best answer that query among a (potentially large) set of potential pages.

Predictive Inference with Weak Supervision

Strong Weak

2 4 5 8 10 15

0 2 5 8 0 2 5 8

factor(Score.Weight)

2 4 5 8 10 15

0 25 50 75 100

CDF P(|C(X)| <= Size)

P | b C(X)| M t)

P(Y b C(X)) P(W b C(X) = )

K = 2 K = 4 K = 6 K = 8 K = 10 K = 15

K = 2 K = 4 K = 6 K = 8 K = 10 K = 15

Figure 5. Results for LETOR ranking dataset (Qin and Liu, 2013) over Ntrials = 20 trials. Each plot represents a diﬀerent value of K {2, 4, 8, 10, 15}, the number of documents to rank, and we compare diﬀerent scoring functions by varying the value of c {0, 2, 5, 8} in equation (18). A: Strong (1) and Weak (3) coverage for the weakly supervised conformal (WSC) and the fully supervised conformal (FSC) conﬁdence sets. B: Distribution of the conﬁdence set size | b C(X)| for diﬀerent numbers K of suggested documents. We display here the distribution of min(| b C(X)|, M) for M = 100.

Cauchois, Gupta, Ali, and Duchi

Learning to rank with Microsoft LETOR dataset (Qin and Liu, 2013) To study that problem, we design an experiment with Microsoft LETOR data set. For each potential query/document pair (x, d), the dataset aggregates several quantities of interest to determine whether d is relevant to x into a d = 46-dimensional feature vector φ(x, d) Rd. For a query x with a set of potentially relevant documents D(x) := {dj}|D(x)| j=1 , our data set additionally contains a ranking Π(x) S|D(x)| that orders these documents according to

their relevance. Our goal is to retrieve that ranking using the feature vectors {φ(x, dj)}D(x) j=1 .

A semi-synthetic weakly supervision set-up We construct weakly supervised calibration and test data sets as follows. For each split (calibration/test), we ﬁrst sample n = 2000 queries from the entire set of queries in LETOR validation and test datasets. For every query Xi, we select K {2, 4, 6, 8, 10, 20} documents by ﬁrst sorting D(Xi) into K equally sized subsets by relevance, so subset ℓ [K] contains the documents with rank Π(x)y for every y {(ℓ 1)|D(Xi)|

K + 1, . . . , ℓ|D(Xi)|

K }, and then drawing one document from each box uniformly at random. This procedure ensures that there exists a signiﬁcant relevance gap between any two potential documents in the query, and that the number of documents to rank is suﬃcient to allow reasonably-sized conﬁdence sets. Π(Xi) additionally induces a sub-ranking Yi SK on these documents, which we treat as a strong label. Similarly to our approach in Section 5.2.1, we introduce partial labels by assuming that our weakly supervised method can only access the ﬁrst Kpartial i elements of Yi, where Kpartial i iid 1 + Poi(.5): this simulates the plausible setting where a user has given feedback on the most relevant documents to the query, but certainly not to all of them. We repeat the entire simulation procedure Ntrials = 20 times.

Building a ranking scoring function (10) We next describe how we use fully supervised training data to construct the scoring function that we feed Alg. 1 with. We ﬁrst learn a linear query/document relevance function

rθ(x, d) := θT φ(x, d) (17)

using the List Net procedure (16) on LETOR (fully supervised) train data xi, (di,j)D(xi) j=1 , yi SD(xi) ntrain

containing 55700 diﬀerent query/document pairs. We then use a speciﬁc implementation of the score function s Ranking as in Eqn (10): if we rank K documents {dk}k [K] for a query x, we rescale our relevance scores to the interval [0, 1],

{rk(x)}k [K] := rθ(x, dk) minj [K] rθ(x, dj) maxj [K] rθ(x, dj) minj [K] rθ(x, dj)

and then, for a choice of c {0, 2, 5, 8}, apply the scoring mechanism (10) with these relevance scores and pairwise comparison function

ψc(r1, r2) := exp( cr1) (r2 r1)+ . (18)

Predictive Inference with Weak Supervision

In this example, we guarantee weak coverage if the true ranking Yi on the ﬁrst Kpartial i elements coincides with either one of the predictive rankings. To keep the predictive set size small, we thus wish to ensure that it doesn t contain two rankings with the same ﬁrst Kpartial i elements (as they would be redundant): this is why we introduce the exponential term exp( cr1), which makes sure that when ranking all conﬁgurations by their score, highly ranked conﬁgurations have diﬀerent ﬁrst elements (rather than diﬀerent last elements). To estimate the distribution of | b C(X)|, we then compute the M = 100 best rankings for each query using Alg 3, and then replace the size of the conﬁdence set by min(M, | b C(X)|)), eﬀectively truncating it to M.

Experimental results We present our results in Figure 5. The conﬁdence sets display the behavior we expect: when the number K of items to classify is small, the fully supervised conformal (FSC) and weakly supervised conformal (WSC) methods are similar, since partial labels are often equal to strong labels. Since the overall number of conﬁgurations is small, both methods are also able to maintain fairly small conﬁdence sets. On the other hand, when K grows, the weak supervision method quickly departs from the full supervision one, and is able to produce conﬁdence sets that are much smaller: when K 8, the FSC method is unable to produce conﬁdence sets with fewer than 100 conﬁgurations, as the number of conﬁgurations is large, and the problem is inherently noisy, especially for comparing documents with a fairly small relevance. The WSC (partially) overcomes that diﬃculty with its restrained notion of coverage, and is able to maintain a majority of conﬁdence sets with size smaller than M, at least until K = 15. Of course, this method pays a price in terms of strong coverage, as for large K, the conﬁdence set almost never contains the actual ground true ranking. That said, it may not a real issue as we are more interested in detecting which documents are actually relevant, and hence should have a higher rank, rather than correctly ordering documents with very little relevance to the query at the bottom of the list. In addition, as we predicted, higher values of c in the pairwise comparison function (18) produce much smaller conﬁdence sets by favoring more diverse rankings at the top of the list.

6 Discussion

The new measures of coverage we develop here tailored to partially supervised data that may be easier to collect in many engineering and measurement-centric scientiﬁc scenarios help to bridge a gap between typical conformal predictive inference methods, which require expensive supervised data, and problems with partial supervision, whose typical focus is on prediction but not uncertainty quantiﬁcation. Our hope is for this paper to open several avenues for future work. First, Algorithm 1 does not currently quantify the amount of coverage it provides conditionally on the query function, which essentially means in an item ranking framework that we do not know ahead of time whether we guarantee the top 2 or top 10 elements of the ranking to be correct. This occurs ﬁrst because the query function is unknown ahead of time, and second because coverage (3) is marginal over the full randomness of the sample. Similarly to conformal inference extension works bridging the gap between marginal and conditional coverage, or between marginal and label-wise

Cauchois, Gupta, Ali, and Duchi

coverage, one potential goal is to adapt these methods and even out coverage conditionally on (plausible) query functions. Our approach acts as a wrapper around any black box machine learning model, providing valid coverage guarantees independent of model quality. However, poorly trained models impact the eﬃciency of prediction sets, which can be large when training data is scarce. Thus, eﬀorts to mitigate overﬁtting and train high-quality models are paramount to ensuring the eﬃciency of our method s prediction sets. Scalability, while generally manageable with our methods for large datasets, presents challenges primarily during the conformalization step. Recognizing the absence of a one-size-ﬁts-all solution, we have tailored a few scalable and, we hope, exemplar methods that capture diverse applications. The new deﬁnition (3) is intrinsically a 0-1 loss-based approach, in the sense that the conﬁdence set b Cn either covers the weakly supervised set or fails. A natural initial extension is thus similarly to what Bates et al. (2021) propose in the strongly supervised case, recognizing that many structured prediction problems (e.g., segmentation tasks or multilabel problems) beneﬁt from more subtle and granular loss functions. In the same vein, we present a few eﬃcient choices of scoring mechanisms for structured prediction, which highlight the practicality and potential application of our general methodology; it seems quite plausible that more sophisticated scoring models could yield substantial improvements. In our view, one of the more exciting potential applications of this work reposes on the (growing) centrality of partial and weakly labeled data in statistical learning (Ratner et al., 2017). Whether this be from partial reporting in surveys, or because collecting labeled data is quite expensive, a major challenge in modern machine learning deployments and the release of statistical models is monitoring their performance. The weaker notions of predictive inference and coverage here, we might hope to build more eﬀective and applicable guardrails and uncertainty measures for modern statistical systems, even as they are released to the world.

Appendix A. A general upper bound for the greedy approach

As we saw in section 3, reasonable conditions on label distributions guarantee that the greedy mappings {Cgr η }η (0,1) solve problem (Cond), while pathologies (as in Example 3) exist. In this section, we show that even in general cases, the sizes of the conﬁdence sets Cgr and Ccond cannot be too far apart. We motivate our approach by noting the similarity between problem (Cond) and the minimum set cover problem familiar in submodular optimization Vazirani (2001); Golovin et al. (2014), which we recall. Let f : 2Y [0, 1] be a monotone submodular coverage function, meaning that for each A B Y and y Y \ B, f satisﬁes f(A) f(B), f(A {y}) f(A) f(B {y}) f(B), f(Y) = 1, and f( ) = 0. A solution to the minimum set cover problem is

C η argmin C Y {|C| s.t. f(C) η} . (19)

A classical result combinatorial optimization of Wolsey (1982) bounds the size of the set that a natural greedy algorithm for problem (19) returns. To state the result, we introduce a bit of notation. For any set C Y and y Y, we deﬁne

(C, y) := f(C {y}) f(C),

Predictive Inference with Weak Supervision

increase in coverage from adding y to C. At each step j [K], the greedy algorithm chooses

yj := argmax y Y ({y1 . . . , yj 1}, y),

and stops at the ﬁrst step j(η) K such that f({y1, . . . , yj(η)}) η. For the greedy set Cgr,j := {y1, . . . , yj}, deﬁne the constant

Kf,η := min

( η η f(Cgr,j(η) 1),

max y Y, j j(η) (Cgr,j,y)>0

( , y) (Cgr,j, y)

, maxy Y ( , y) maxy Y\Cgr,j(η) 1 (Cgr,j(η) 1, y)

We then have the following result.

Lemma 22 (Wolsey (1982), Theorem 1) Let f : 2Y [0, 1] be a monotone submodular coverage function. Then

|Cgr,j(η)| 1 + log Kf,η |C η|

Given the apparent similarity between the problems (19) and (Cond), we would like to leverage Lemma 22 to establish a similar guarantee for Alg. 2. To apply Lemma 22 to Alg. 2, we provide the natural analogous quantities, leveraging the notation in the algorithm and working conditional on X = x. Deﬁne fx(C) := P(W C = | X = x), which is immediately a submodular coverage function, and for each x we have increment function x(C, y) = P(W C = , y W | X = x). Because the greedy sets Cgr η (x, u) may be randomized but always satisfy Cgr η (x, 1) Cgr η (x, 0), we provide a slight alternative to the constant Kf,η, deﬁning

KP,η,x := min

( η η P(W Cgr η (x, 1) = | X = x),

max y Y, j j(x,η) x(Cgr,j(x),y)>0

x( , y) x(Cgr,j(x), y)

, maxy Y ( , y) maxy Y (Cgr η (x, 1), y)

Invoking Lemma 22 and simplifying gives the following result, which bounds the size of the greedy set by a logarithmic quantity times the size of the best (deterministic) covering set.

Corollary 23 Let Cgr η : X [0, 1] Y be the conﬁdence set mapping Algorithm 2 outputs. Then for all x X and u [0, 1],

|Cgr η (x, u)| |Cgr η (x, 0)| 1 + log KP,η,x min C Y {|C| s.t. P(W C = | x) η} .

We can roughly interpret the three terms inside the minimum in (20) as follows. The ﬁrst term is large when the greedy algorithm nearly attains the required coverage on the iteration

Cauchois, Gupta, Ali, and Duchi

just before terminating, and therefore measures (in a sense) how wasteful the algorithm is. The second term is large when choosing a label earlier would have improved the coverage more, and so expresses a kind of regret. The third term measures how often the labels y Y co-occur in W. Though the bound is a functional of the discrete derivative x(C, y) and small when the local information in x(C, y) gives good indicators of globally optimal sets C, it can be hard to compute explicitly; we therefore evaluate the size of the sets that Alg. 2 generates for a few experimental examples in Section 5.

Appendix B. Proofs of mathematical results

B.1 Proofs of lower bounds on conﬁdence set sizes

B.1.1 Proof of Theorem 3

Suppose that P is a consistent distribution on X Y 2Y, with marginal Pweak over X 2Y, and consider a procedure b Cn oﬀering 1 α strong distribution-free coverage. Let P be the distribution on X Y 2Y with Pweak = Pweak, and where we deﬁne P by the triple ( X, Y , W) P according to

Y = argmin y W

pn( X, y) := P(Xi,W)n i=1 iid Pweak

h y b Cn( X) i .

Then, P is a consistent distribution on X Y 2Y, which ensures that

P(Xi,Yi,Wi)n+1 i=1 iid P

h Yn+1 b Cn(Xn+1) i 1 α.

By deﬁnition of P, we have

P(Xi,Yi,Wi)n+1 i=1 iid P

h Yn+1 b Cn(Xn+1) i = E(Xn+1,Yn+1,Wn+1) W [pn(Xn+1, Yn+1)] ,

the law of b Cn is identical under P or P, as it only depends on (Xi, Wi)n i=1 iid Pweak. On the other hand, we observe that when (Xn+1, Yn+1, Wn+1) P,

pn(Xn+1, Yn+1) = inf y Wn+1 pn(Xn+1, y),

which guarantees that

1 α E(Xn+1,Yn+1,Wn+1) P [pn(Xn+1, Yn+1)]

= E(Xn+1,Yn+1,Wn+1) P

inf y Wn+1 pn(Xn+1, y)

= E(Xn+1,Wn+1) Pweak

inf y Wn+1 pn(Xn+1, y)

= E(Xn+1,Wn+1) Pweak

inf y Wn+1 pn(Xn+1, y) .

Predictive Inference with Weak Supervision

B.1.2 Proof of Corollary 8

Consider (X, W) Pweak independent of (Xi, Wi)i 1, and deﬁne p(x, y) = P(y C(x)), recalling the deﬁnition pn(x, y) = P(y b Cn(x)). Then because for any set W Y and any functions f and g we have inf y W f(y) inf y W g(y) µ(W) inf y W |f(y) g(y)|µ(W) Z

W |f(y) g(y)|dµ(y),

we obtain by Jensen s inequality and Fubini s theorem that

E inf y W p(X, y) inf y W pn(X, y) µ(W) E Z

Y |p(X, y) pn(X, y)| dµ(y)

1{y b Cn(X)} 1{y C(X)} dµ(y) = E h µ b Cn(X) C(X) i .

Taking the limit as E[µ( b Cn(X) C(X))] 0 as n we thus have

lim n E inf y W p(X, y) inf y W pn(X, y) µ(W) = 0.

By monotonicity and the assumption that µ(W) > 0 with probability 1, for any ϵ > 0 we may choose c > 0 such that P(µ(W) < c) ϵ, and thus

lim n P inf y W p(X, y) inf y W pn(X, y) ϵ

lim n P inf y W p(X, y) inf y W pn(X, y) ϵ, µ(W) c + P(µ(W) < c) 0 + ϵ.

In particular, | infy W p(X, y) infy W pn(X, y)| p 0, and as pn and p both take values in [0, 1], Theorem 3 s conclusion that E[infy W pn(X, y)] 1 α implies

E inf y W p(X, y) 1 α.

As infy W p(X, y) [0, 1], this in turn implies P(infy W p(X, y) > 0) 1 α. Finally, we note the following equivalence: a target y W Det C(x)\C(x) if and only if y W {y Y : p(x, y) = 0} if and only if infy W p(x, y) = 0. That is, we have the event equalities

{W Det C(x) C(x)} = inf y W p(x, y) > 0 ,

so that P(W Det C(X) C(X)) = P(infy W p(X, y) > 0) 1 α.

B.2 Proofs on size set optimality in weak supervision

B.2.1 Proof of Lemma 12

Fix η0 (0, 1). Then y Cη0(x, u) implies that snest(x, y, u) = inf{η | y Cη(x, u)} η0 and so snest(x, y, u) η0. Conversely, assume that snest(x, y, u) η0. Then by deﬁnition of snest, y Cη0(x, u) if and only if for all η > η0, we have y Cη(x, u) but y Cη0(x, u), and therefore snest(x, y, u) = η0. But of course, by continuity, P(snest(x, y, U) = η0) = 0, and so

P(snest(x, y, U) η0 and y Cη0(x, U)) = 0.

Cauchois, Gupta, Ali, and Duchi

B.2.2 Proof of Proposition 17

The case where W | X = x has a label-independent structure is immediate, hence we focus on proving the result when W | X = x has a label tree-structure (8). We prove the result by induction on the size of Y , observing that the result is immediate if |Y | = 1. If |Y | = K > 1, we assume that the result holds on sets with at most K 1 elements. We denote Px the law of W | X = x, by Pu the law of U and P = Px Pu their joint distribution, and similarly for their expectations. Fix η (0, 1), and let C : [0, 1] Y be a conﬁdence set mapping satisfying

P(C(U) W = ) η.

We will prove that

Eu|CGreedy η (x, U)| Eu|C(U)|.

We use the label ranking y1(x), . . . , y K(x) that Alg. 2 deﬁnes, omitting x for simplicity, and consider two cases:

Case 1: Pu(y K C(U)) = 0.

Then C provides coverage at level η using only the K 1 ﬁrst labels, which also guarantees that CGreedy η (x, u) only contains labels in {y1, . . . , y K 1} (since, in that case, Jη K 1 in Alg. 2). The induction hypothesis applied to the distribution of W \ {y K} thus ensures that Eu|CGreedy η (x, U)| Eu|C(U)|.

Case 2: Pu(y K C(U)) > 0.

In that case, we will prove that either Pu(yj C(U)) = 1 for all j [K 1], or that we can build a new conﬁdence set Cﬁnal(U) such that

P(Cﬁnal(U) W = | X = x) P(C(U) W = | X = x), Eu|Cﬁnal(x, U)| = Eu|C(U)|,

and veriﬁes either Pu(yj Cﬁnal(U)) = 1 for all j [K 1], or Pu(y K Cﬁnal(U)) = 0.

The distribution Px induces a tree whose leaves are the labels y1, . . . , y K, and each inner node N (apart from the root, which is Y itself) is a subset of Y such that Px(W = N) > 0, and two nodes N1 and N2 share the same parent if any subset C containing strictly either N1 or N2 such that Px(W = C) > 0 contains N1 N2. This parent is then the smallest subset Np such that N1 N2 Np and Px(W = Np) > 0. Each parent is then the union of all its children. Figure 1 provides an example of such a tree.

Deﬁning D(C) := {y Y \ {y K} | Pu(yj C(U)) < 1} = , we consider the element y D(C) sharing the lowest common ancestor with y K in the tree. For instance, in Figure 1, if D(C) = {y3, y4}, then y D = y3, as their common ancestor W0 is lower than the common ancestor of y5 and y4 (Y itself).

We then proceed to deﬁne C(U) from C(U) so that

C(U) \ {y K, y D} = C(U) \ {y K, y D} (21)

Predictive Inference with Weak Supervision

Eu| C(U) {y K, y D}| = Eu|C(U) {y K, y D}|, (22)

but now either

Pu(y D C(U)) = 1 or Pu(y K C(U)) = 0.

In practice, we do so by replacing y K by y D when C(U) {y K, y D} = {y K(x)}, or/and decreasing the probabilities that C(U) {y K, y D} = {y D, y K} and C(U) {y K, y D} = , in such a way that the average size does not vary, but the probability that C(U) {y K, y D} = {y D} increases.

We then proceed to check that such a change cannot hurt our coverage, while it evidently leaves the average conﬁdence set size unchanged (because of equations (21) and (22)).

The only way we can have C(U) W = and C(U) W = is if W = {y K}. On the other hand, when W = {y D}, we can have C(U) W = but C(U) W = . Because of the deﬁnition of y D with respect to y K, any other value of W such that C(U) W = will be such that C(U) W = . In particular, by independence of W and U, we have

P( C(U) W = ) P(C(U) W = )

Px(W = {y K(x)}) h Pu(y K C(U)) Pu(y K C(U)) i

+ Px(W = {y D}) h Pu(y D C(U)) Pu(y D C(U)) i

= Pu(y D C(U)) Pu(y D C(U)) (Px(W = {y D}) Px(W = {y K})) ,

since Pu(y K C(U)) + P(y D C(U)) = Pu(y K C(U)) + P(y D C(U)), as the total average size does not vary.

In addition, since y K gets selected last in Alg. 2, we know that for all y Y ,

Px (W = {y D}) Px (W = {y K})) ,

which achieves to prove that

P( C(U) W = ) P(C(U) W = ) η.

If Pu(y D C(U)) = 1, then |D( C)| |D(C)| 1, and we can repeat the process until we obtain a ﬁnal mapping Cﬁnal such that either D(Cﬁnal) = or P(y K(x) Cﬁnal(U)) = 0.

In the ﬁrst scenario where eventually D(Cﬁnal) = , it means that Cﬁnal(U) is either Y or Y \ {y K}, and this is immediate to check that since P(Cﬁnal(U) W = | X = x) η, we must have Pu(y K Cﬁnal(U)) Pu(y K CCond-Prox η (x, U)), which in turn ensures that

Eu|CCond-Prox η (x, U)| Eu|Cﬁnal(U)| = E|C(U)|.

Cauchois, Gupta, Ali, and Duchi

In the second case, since PU(y K Cﬁnal(U)) = 0, Cﬁnal is eﬀectively a conﬁdence set over strictly less than K labels, in which case we can apply the induction hypothesis to conclude that

Eu|CCond-Prox η (x, U)| Eu|Cﬁnal(U)| = Eu|C(U)|.

B.3 Proofs of algorithmic validity

B.3.1 Proof of Lemma 21

We prove the result by proving that if we run Algorithm 3 with M = K!, deﬁning at each step ym j,2 as in equation (11), then at each step m K! of the algorithm, {Ym j }j [m] is a valid partition of Y that satisﬁes the following conditions:

1. For each j [m], we have Ym j = {ym j } if and only if there exists no i [K 1] such that (i + 1 i) ym j Ym j .

2. If Ym j = {ym j } then s(x, ym j ) s(x, ym j,2).

If the partition satisﬁes these two conditions at every step, then we can run the algorithm until step m = K!, at which point it returns a partition {YK! j }K! k=1 such that s(x, y K! 1 ) s(x, y K! K!). Now, since we have sm j = s K! j for every 1 j m, we conclude that, at each step m [K!], we have

s(x, ym 1 ) s(x, ym 2 ) s(x, ym m) min y Y\{ym 1 ,...,ym m} s(x, y),

which proves the validity of the algorithm. This is of course true for m = 1, since the best conﬁguration simply ranks rk(x) in decreasing order.

1. By deﬁnition of ym ind(m),2 and yind(m), {Ym+1 j }m+1 j=1 is a valid partition of Y such that

each ym+1 j Ym+1 j , if {Ym j } is itself a valid partition (and m < K!), so long as we can prove that if Ym j = {ym j }, then there must exist α [K 1] such that (α + 1 α) ym j Ym j (i.e the algorithm does not get stuck and terminates). But this is immediate as, if for all α [K 1], we have (α + 1 α) ym j / Ym j , then it must be by construction that

α [K 1] {y Y | y 1(ym j (α)) < y 1(ym j (α + 1))} = {ym j }.

Therefore, at each step m K! of the algorithm, {Ym j }j [m] is a valid partition of Y, and the algorithm terminates.

2. On the other hand, it requires more care to justify why, if we set, for all j m,

ym j,2 := argmin y Ym j {s(x, y) | α [K 1], y = (α + 1 α) ym j }

Predictive Inference with Weak Supervision

then we should always have

s(x, ym j ) s(x, ym j,2) (23)

for all j [m], i.e. why any permutation of the form (α α + 1) ym j that belongs to Ym j cannot strictly decrease the score s(x, y).

Equation (23) actually results from a crucial property of the score function (10), which ensures that if s(x, y) < s(x, (α α + 1) y), then it must hold that

ry(α)(x) < ry(α+1)(x),

i.e. the elements y(α) and y(α + 1) were originally in the wrong order in y.

But, since we start the partition process with y1 1 such that ry1 1(1)(x) ry1 1(k)(x), i.e with all elements in the correct order, it is straightforward to check that at any time m, there cannot exist a permutation of the form (α α + 1) ym j that also belongs to Ym j such that

rym j (α)(x) < rym j (α+1)(x) :

if that were the case, then there would exist l j such that ym j (α) and ym j (α + 1) were the elements exchanged at time l when creating the partition {Yl+1 i }l+1 i=1. In turn, since ym j Ym j , this would guarantee that

Ym j {y Y | y 1(ym j (α)) < y 1(ym j (α + 1))},

and thus that (α α + 1) ym j / Ym j . This guarantees that any conﬁguration (α α + 1) ym j Ym j satisﬁes

s(x, (α α + 1) ym j ) s(x, ym j ),

and thus that either Ym j = {ym j }, or

s(x, ym j,2) s(x, yj),

which concludes the proof.

Appendix C. Further experiments

C.1 Structured prediction example (Perfect matching scores and weak supervision)

A matching task consists of optimally pairing elements of a bipartite graph given a feature vector x X. For example, one may wish to identify paired amino acids in protein folding (Taskar et al., 2003). We assume there exists a bipartite graph G with disjoint sets U and V of K 1 nodes; each label Y is then a perfect matching between U and V , i.e., a bijection Y Y = S(U, V ). General supervised approaches for perfect matching problems, such as structured Support Vector Machines (Tsochantaridis et al., 2004) or Adversarial Bipartite Matching (Fathony et al., 2018), generally learn pairwise score functions

Cauchois, Gupta, Ali, and Duchi

ϕu,v : X R for all (u, v) U V , which measure the cost of adding the edge e := (u, v) for a feature vector x X, and then output a prediction

argmin y S(U,V )

u U ϕu,y(u)(x) = X

u U,v V 1{v = y(u)} ϕu,v(x)

an instance of minimum cost perfect matching solvable in time O(K3) with the Hungarian algorithm. To eﬃciently adapt this approach in the context of Alg. 1, we assume we have trained a set of pairwise score functions {ϕu,v : X R | (u, v) U V }, (e.g. using supervised training data) and wish to conformalize with partially supervised data, using the score function

s Matching(x, y) := X

u U,v V 1{v = y(u)} ϕu,v(x, y). (24)

In a matching problem, weak supervision can arise under the form of a partial matching between subsets Ui U and Vi V of the nodes, which we write Y weak i S(Ui, Vi): each u Ui has a matching element Y weak i (u) = Yi(u) Vi. Computing the minimum partial scores (5) in Alg. 1 is then computationally eﬃcient, as it reduces to yet another minimum cost perfect matching problem:

u Ui ϕu,Y weak i (u)(x) + min y S(U\Ui,V \Vi)

u U\Ui v V \Vi

1{v = y(u)} ϕu,v(x)

In the matching case, Alg. 3 is equivalent to ﬁnding the M-best minimal weight perfect matching in a bipartite graph, which Chegireddy and Hamacher (1987) eﬃciently solve. In the context of Alg. 3, their approach iteratively chooses an edge em ym ind(m)\ym ind(m),2, then partitions the set of matchings M Ym ind(m) depending on whether em M or not. The computation of each second-best conﬁguration then amounts to solving at most K diﬀerent perfect matching problems, resulting in an overall O(MK4) cost of the procedure.

C.2 Pedestrian tracking with partial matching information

We now apply our weakly supervised conformal methods to a bipartite matching problem. A common objective in computer vision, relevant for instance for self-driving cars, is to track people s trajectory throughout diﬀerent time frames. Since we can leverage powerful algorithms (Redmon et al., 2016) to individually detect objects in every single frame, the problem that we study here is actually a matching task where the goal is to match two sets of people appearing in two separate frames: this is an instance of a maximal matching problem.

Weak supervision with partial matchings In this context, we expect partial supervision to come under the form of a partial matching: some people, e.g. people that are easier to track between two consecutive frames because they are in the foreground, already have their match in the second frame, when others, perhaps more diﬃcult to track, are still waiting for a potential match. Given these partially labeled instances, the goal then

Predictive Inference with Weak Supervision

Strong Weak

ETH B vs ETH P/ETH S

ETH P vs ETH B/ETH S

ETH S vs ETH B/ETH P

ETH B vs ETH P/ETH S

ETH P vs ETH B/ETH S

ETH S vs ETH B/ETH P

ETH B vs ETH P/ETH S

ETH P vs ETH B/ETH S

ETH S vs ETH B/ETH P

Train/Validation split Train/Validation split

P(Y b C(X)) P(W b C(X) = ) A B

Figure 6. Results for the video tracking matching dataset MOT2015 Leal-Taix e et al. (2015), over Ntrials = 20 trials. We use one video sequence for training (ETH-B, ETH-P or ETH-S) and the two others for calibration and testing. A: Strong (1) and weak (3) coverage. B: Average conﬁdence set size for fully supervised (FSC) and weakly supervised conformal (WSC) methods.

becomes to return conﬁdence sets of matchings that guarantee 1 α coverage: to provide valid weak coverage (3), we wish to include a conﬁguration that contains at least all the partial matches.

Predicting trajectories in the MOT2D15 data set (Leal-Taix e et al., 2015) We experiment using the MOT2D15 pedestrian video tracking dataset (Leal-Taix e et al., 2015). This public benchmark contains short street videos with pedestrians, and the goal is to track each of them while they appear in the frame. Each frame has a set of bounding boxes corresponding to each individual present in the frame, and we seek to match boxes representing the same person between two consecutive frames. Since an individual can enter or exit the frame between two consecutive images, we need to account for potentially unmatched boxes, which we do by including virtual nodes in the bipartite graph, similarly to previous approaches (Kim et al., 2013; Fathony et al., 2018). We use the same feature representation of Kim et al. (2013) and Fathony et al. (2018): given a pair x := (x1, x2) of two consecutive images, and two bounding boxes u x1, v x2, we compute a d = 46 dimensional vector φ(x1, x2, u, v) that summarizes key features (e.g. position of the bounding box, color distribution) and allows determining whether u and v represent the same person. We then train our model using a structured S-SVM approach (Tsochantaridis et al., 2004), following Kim et al. (2013). The model outputs a pairwise score function s PW : (x, u, v) 7 θT φ(x1, x2, u, v) for some vector θ Rd where the feature vector x = (x1, x2) X consists of two consecutive frames, and (u, v) are two potential bounding boxes (one in each image).

Experimental set-up and partial labels We use MOT2D15 as follows. For each of the ETH-Bahnhof, ETH-Pedcross2 and ETH-Sunnyday video sequences, which contain respectively 1000, 837, and 354 consecutive images, we select one for training, one for calibration, and the last for testing, using α = 0.02 for conformalization purposes. We introduce weak supervision by assuming that for each pair of images, we observe a partial matching: among the Ki paired individuals, a user provides feedback on Kpartial i iid 1 + Poi(0.5) matches.

Cauchois, Gupta, Ali, and Duchi

For both our FSC and WSC methods, we use a translated version of the s Matching

score (24) with pairwise functions su,v(x) := s PW(x, u, v): for each instance (x, y) X Y, we use the score

s Matching(x, y) := s Matching(x, y) min y Y s Matching(x, y)

This operation ensures that miny Y s Matching(x, y) = 0 for every x X and thus does not change the ordering of conﬁgurations or the score diﬀerence between two conﬁgurations; we simply use it to place all the instances Xi on the same scale when applying Algorithm 1. In particular, in the noiseless case where the true label Yi is always the minimizer of y 7 s Matching(Xi, y), this guarantees that b C(X) eventually contains a single conﬁguration (as we should, since the score function in this case outputs perfect predictions).

Experimental results This speciﬁc problem is actually low-noise, as it is possible to achieve a very high accuracy with the S-SVM approach, which is not so surprising as we assume perfect detection of every person thanks the bounding boxes. As a result, we can expect the FSC and WSC methods to output very similar conﬁdence sets, as the conﬁguration minimizing the score is often the true label itself. This is precisely what we observe in Figure 6, where both methods are actually indistinguishable and where, even with a very high conﬁdence 1 α. = 0.98, both the FSC and WSC methods return conﬁdence sets with a single conﬁguration on average. We only notice a slight diﬀerence between both methods when training on the ETH-Sunnyday sequence, which contains fewer images, and hence produces slightly worse score functions.

C.3 Prediction intervals for weakly supervised regression

Much of our development goes beyond (ﬁnite) spaces with combinatorial structure. We therefore ﬁnish with an exemplar regression problem. We consider predicting the fraction of votes in each United States county for the Democratic Party candidate in the 2016 United States presidential election, using demographic (census) data as covariates and the results of past elections. It is common during elections for forecasters to build predictive models from both census and historical election data, as well as current polling data. We view the historical data as strong supervision (it tells us exactly how many people voted for each candidate), and the polling data as weak supervision (as polls always come with a margin of error). Our goal here is to ﬁt a regression model to the strongly supervised past election data, and then form prediction intervals for the fraction of people in each county who voted Democrat by leveraging the weakly supervised polling data. We hope by combining both we obtain valid intervals narrower than the polling margins of error. Our data comes from the 2013 2017 American Community Survey 5-Year Estimates, a longitudinal survey that records demographic information about each of the 3220 United States counties. We use d = 34 of the available demographic features, and the response is the fraction of people in each county who voted Democratic. To experiment with this dataset, we split it into thirds: 33% of the counties (and their associated fractions of Democratic voters) go into the training set, 33% go into the calibration set, and the rest go into the test set; as our splits are random, they are exchangeable. We ﬁt a Beta regression model to the strongly supervised training data. To simulate the availability of weakly supervised polling

Predictive Inference with Weak Supervision

data, we replace each calibration set response Yi [0, 1] with a weak response Wi [0, 1], i = 1, . . . , n, by forming intervals

Wi = [Yi Zi, Yi + Zi], Zi N(µ, 0.0001), i = 1, . . . , n, (25)

for various values of µ {.01, .05, .1, .15, .2}, so that Wi captures ﬂuctuations of roughly µ around Yi. We conformalize by running Alg. 1 with the absolute error score

s(x, y) = |ˆy(x) y|,

where ˆy(x) [0, 1] denotes the Beta regression model s prediction for the point x Rd. Here, conformalization boils down to solving the simple linear program

s(Xi, Yi) = min γ R {|ˆy(Xi) γ| | Yi Zi γ, γ Yi + Zi} , i = 1, . . . , n.

Finally, we evaluate both strong (1) and weak (3) coverage on the test set, applying the same transformation (25) to generate the weak labels for the test set. We compute the two types of coverage, as well as the lengths of the prediction intervals, by repeating this process 20 times. We set the miscoverage level α = .05. Similar to our other experiments, Alg. 1 achieves weak coverage at the nominal level .95 for all values of µ (governing the amount of weak supervision), as in the middle panel of Figure 7. We expect the strong coverage to be lower. The left panel of Figure 7 uses compares the strong coverage Alg. 1 achieves in teal, showing the coverage of standard conformal inference (using the correct responses Yi) in pink. Because it provides weak coverage, we expect Alg. 1 to generate shorter prediction intervals than standard conformal inference. The right panel of Figure 7 exhibits this: when µ .1, the average length of Alg. 1 s intervals is at least three times smaller than standard conformal s, and half the length of the average weakly supervised interval Wi from (25) ( .2). We can also see from these ﬁgures, as in our other experiments, that Alg. 1 s strong coverage degrades as µ grows, whereas its weak coverage improves and the length of its prediction intervals shrinks. We view these results from a slightly diﬀerent perspective in Figures 8 and 9. In Figure 8, we show the true fraction of Democratic votes in each county in the test. In Figure 9, we show the lower and upper endpoints of Alg. 1 s prediction intervals, for a randomly chosen repetition with µ = .05. In these two ﬁgures, we color the counties with strong (predicted) Democratic majorities blue, and those with strong (predicted) Republican majorities red. By comparing the colors, we can see that the prediction intervals only sometimes contain the true response, which is expected. Finally, we note that the colors of the lower and upper endpoints in Figure 9 are similar, because the length of the prediction intervals is usually small.

Cauchois, Gupta, Ali, and Duchi

0.01 0.05 0.1 0.15 0.2

0.01 0.05 0.1 0.15 0.2

0.01 0.05 0.1 0.15 0.2 mu

Average Interval Size

Figure 7. Results for the regression problem with the election data over 20 trials. Left panel: strong coverage (1). Middle panel weak coverage (3). The dashed red line indicates the nominal coverage level, 1 α = .95. Right panel: prediction interval lengths. In these plots, we show Alg. 1, denoted WSC , in teal. We show standard conformal inference, denoted FSC , in pink.

Figure 8. Map of United States counties. We color each county according to the actual fraction votes for the Democratic candidate in the 2016 United States presidential election. We color counties with strong Democratic majorities blue, and those with strong Republican majorities red. We color the counties from the training and calibration sets gray.

Predictive Inference with Weak Supervision

Figure 9. Map of United States counties. We color each county according to the value of the upper (top panel) and lower (bottom panel) endpoints of the conﬁdence interval that Alg. 1 returns, when µ = .05. We color counties with values close to 1 blue, and those with values close to 0 red. We color the counties from the training and calibration sets gray.

Cauchois, Gupta, Ali, and Duchi

N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. Journal of the Association for Computing Machinery, 55(5), 2008.

A. Angelopoulos, S. Bates, J. Malik, and M. I. Jordan. Uncertainty sets for image classiﬁers using conformal prediction. ar Xiv:2009.14193 [cs.CV], 2020.

R. F. Barber, E. J. Cand es, A. Ramdas, and R. J. Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference, 10(2):455 482, 2021.

S. Bates, A. Angelopoulos, L. Lei, J. Malik, and M. I. Jordan. Distribution-free, riskcontrolling prediction sets. Journal of the Association for Computing Machinery, 68(6): 43:1 43:34, 2021.

M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classiﬁcation. Pattern Recognition, 37(9):1757 1771, 2004.

V. Cabannes, A. Rudi, and F. Bach. Structured prediction with partial labelling through the inﬁmum loss. In Proceedings of the 37th International Conference on Machine Learning, 2020.

Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, pages 129 136, 2007.

M. Cauchois, S. Gupta, A. Ali, and J. Duchi. Robust validation: Conﬁdent predictions even when distributions shift. ar Xiv:2008.04267 [stat.ML], 2020.

M. Cauchois, S. Gupta, and J. Duchi. Knowing what you know: valid and validated conﬁdence sets in multiclass and multilabel prediction. Journal of Machine Learning Research, 22(81):1 42, 2021.

C. R. Chegireddy and H. W. Hamacher. Algorithms for ﬁnding k-best perfect matchings. Discrete Applied Mathematics, 18(2):155 165, 1987.

J. Cid-Sueiro, D. Garc ıa-Garc ıa, and R. Santos-Rodr ıguez. Consistency of losses for learning from weak labels. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pages 197 210, 2014.

T. Cour, B. Sapp, and B. Taskar. Learning from partial labels. Journal of Machine Learning Research, 12:1501 -1536, 2011.

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Image Net: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255, 2009.

D. L. Donoho. 50 years of data science. Journal of Computational and Graphical Statistics, 26(4):745 766, 2017.

Predictive Inference with Weak Supervision

J. C. Duchi, L. Mackey, and M. I. Jordan. The asymptotics of ranking algorithms. Annals of Statistics, 41(5):2292 2323, 2013.

B.-S. Einbinder, Y. Romano, M. Sesia, and Y. Zhou. Training uncertainty-aware classiﬁers with conformalized deep learning. In Advances in Neural Information Processing Systems 35, 2022.

A. Elisseeﬀand J. Weston. A kernel method for multi-labeled classiﬁcation. In Advances in Neural Information Processing Systems 14, 2001.

R. Fathony, S. Behpour, X. Zhang, and B. Ziebart. Eﬃcient and consistent adversarial bipartite matching. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1457 1466, 2018.

E. R. Fernandes and U. Brefeld. Learning from partially annotated sequences. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pages 407 422, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.

A. Fisch, T. Schuster, T. Jaakkola, and R. Barzilay. Eﬃcient conformal prediction via cascaded inference with expanded admission. In Proceedings of the Ninth International Conference on Learning Representations, 2021.

Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. Eﬃcient boosting algorithms for combining preferences. Journal of Machine Learning Research, 4:933 969, 2003.

S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V. Ramanujan, Y. Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. Dimakis, J. Jitsev, Y. Carmon, V. Shankar, and L. Schmidt. Data Comp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems 36, 2023.

D. Golovin, A. Krause, and M. Streeter. Online submodular maximization under a matroid constraint with application to learning assignments. ar Xiv:1407.1082 [cs.LG], 2014.

C. Gupta, A. K. Kuchibhotla, and A. K. Ramdas. Nested conformal prediction and quantile out-of-bag ensemble methods. Pattern Recognition, 127, 2022. Special Issue on Conformal and Probabilistic Prediction with Applications.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

Y. Hechtlinger, B. P oczos, and L. Wasserman. Cautious deep learning. ar Xiv:1805.09460 [stat.ML], 2019.

E. H ullermeier, J. F urnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artiﬁcial Intelligence, 172(16 17):1897 -1916, Nov. 2008.

Cauchois, Gupta, Ali, and Duchi

J. G. Kemeny. Mathematics without numbers. Daedalus, 88(4):577 591, 1959.

M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81 93, 1938.

S. Kim, S. Kwak, J. Feyereisl, and B. Han. Online multi-target tracking by large margin structured learning. In Proceedings of the Asian Conference on Computer Vision, pages 98 111, 2013.

A. Korba, A. Garcia, and F. d Alch e Buc. A structured prediction approach for label ranking. In Advances in Neural Information Processing Systems 31, pages 8994 9004, 2018.

L. Leal-Taix e, A. Milan, I. Reid, S. Roth, and K. Schindler. MOTChallenge 2015: Towards a benchmark for multi-target tracking. ar Xiv:1504.01942 [cs.CV], 2015.

J. Lei. Classiﬁcation with conﬁdence. Biometrika, 101(4):755 769, 2014.

J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society, Series B, 76(1):71 96, 2014.

J. Lei, M. G Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523): 1094 1111, 2018.

S. Mayhew, S. Chaturvedi, C.-T. Tsai, and D. Roth. Named entity recognition with partially annotated training data. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 645 655, 01 2019.

S. Negahban, S. Oh, and D. Shah. Iterative ranking from pair-wise comparisons. Operations Research, 65(1):266 287, 2016.

N. Nguyen and R. Caruana. Classiﬁcation with partial labels. In Proceedings of the 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 551 559, 2008.

G. Papandreou, L. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the International Conference on Computer Vision, pages 1742 1750, 2015.

T. Qin and T. Liu. Introducing LETOR 4.0 datasets. ar Xiv:1306.2597 [cs.IR], 2013.

A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R e. Snorkel: rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269 282, 2017.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uniﬁed, real-time object detection. In Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition, pages 779 788, 2016.

Y. Romano, R. Barber, C. Sabatti, and E. J. Candes. With malice towards none: Assessing uncertainty via equalized coverage. ar Xiv:1908.05428 [stat.ME], 2019a.

Predictive Inference with Weak Supervision

Y. Romano, E. Patterson, and E. J. Cand es. Conformalized quantile regression. In Advances in Neural Information Processing Systems 32, 2019b.

Y. Romano, M. Sesia, and E. J. Cand es. Classiﬁcation with valid and adaptive coverage. In Advances in Neural Information Processing Systems 33, 2020.

M. Sadinle, J. Lei, and L. Wasserman. Least ambiguous set-valued classiﬁers with bounded error levels. Journal of the American Statistical Association, 114(525):223 234, 2019.

D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems 28, 2015.

B. Taskar. Learning Structured Prediction Models: A Large Margin Approach. Ph D thesis, Stanford University, 2005.

B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in Neural Information Processing Systems 16, 2003.

R. J. Tibshirani, R. F. Barber, E. J. Cand es, and A. Ramdas. Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems 32, 2019.

B. Triggs and J. Verbeek. Scene segmentation with crfs learned from partially labeled images. In Advances in Neural Information Processing Systems 21, volume 20, pages 1553 1560, 2008.

I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the Twenty-First International Conference on Machine Learning, 2004.

B. van Rooyen and R. C. Williamson. A theory of learning with corrupted labels. Journal of Machine Learning Research, 18(228):1 50, 2018.

A. van Zuylen, R. Hegde, K. Jain, and D. P. Williamson. Deterministic pivoting algorithms for constrained ranking and clustering problems. In Proceedings of the Eighteenth ACMSIAM Symposium on Discrete Algorithms (SODA), pages 405 414. Society for Industrial and Applied Mathematics, 2007.

V. V. Vazirani. Approximation Algorithms. Springer, 2001.

V. Vovk, A. Grammerman, and G. Shafer. Algorithmic Learning in a Random World. Springer, 2005.

L. A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2:385 393, 1982.

H.-F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 593 601, 2014.

Cauchois, Gupta, Ali, and Duchi

C. Zhang, C. R e, M. Cafarella, C. De Sa, A. Ratner, J. Shin, F. Wang, and S. Wu. Deep Dive: Declarative knowledge base construction. Communications of the ACM, 60(5):93 102, 2017.