# predictive_inference_with_weak_supervision__bf111ad9.pdf Journal of Machine Learning Research 25 (2024) 1-46 Submitted 3/23; Revised 5/24; Published 5/24 Predictive Inference with Weak Supervision Maxime Cauchois maxime.cauchois@gmail.com Department of Statistics Stanford University Stanford, CA 94305-4020, USA Suyash Gupta suyash028@gmail.com Department of Statistics Stanford University Stanford, CA 94305-4020, USA Alnur Ali alnurali@gmail.com Department of Statistics Stanford University Stanford, CA 94305-4020, USA John Duchi jduchi@stanford.edu Department of Statistics and electrical Engineering Stanford University Stanford, CA 94305-4020, USA Editor: Daniel Hsu The expense of acquiring labels in large-scale statistical machine learning makes partially and weakly-labeled data attractive, though it is not always apparent how to leverage such data for model fitting or validation. We present a methodology to bridge the gap between partial supervision and validation, developing a conformal prediction framework to provide valid predictive confidence sets sets that cover a true label with a prescribed probability, independent of the underlying distribution using weakly labeled data. To do so, we introduce a (necessary) new notion of coverage and predictive validity, then develop several application scenarios, providing efficient algorithms for classification and several large-scale structured prediction problems. We corroborate the hypothesis that the new coverage definition allows for tighter and more informative (but valid) confidence sets through several experiments. Keywords: Conformal inference, Confidence sets, Coverage validity, Weak supervision, Partial labels 1 Introduction Consider the typical supervised learning pipeline that we teach students in statistical machine learning: we collect data in (X, Y ) pairs, where Y is a label or target to be predicted; we pick a model and loss measuring the fidelity of the model to observed data; we choose the model minimizing the loss and validate it on held-out data. This picture obscures what is becoming one of the major challenges in this endeavor: that of actually collecting highquality labeled data (Sculley et al., 2015; Donoho, 2017; Ratner et al., 2017; Gadre et al., 2024 Maxime Cauchois, Suyash Gupta, Alnur Ali and John Duchi. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0253.html. Cauchois, Gupta, Ali, and Duchi 2023). Hand labeling large-scale training sets is often impractically expensive. Consider, as simple motivation, a ranking problem: a prediction is an ordered list of a set of items, yet available feedback is likely to be incomplete and partial, such as a top element (for example, in web search a user clicks on a single preferred link, or in a grocery, an individual buys one kind of milk but provides no feedback on the other brands present). Developing methods to leverage such partial and weak feedback is therefore becoming a major focus, and researchers have developed methods to transform weak and noisy labels into a dataset with strong, gold-standard labels (Ratner et al., 2017; Zhang et al., 2017). In this paper, we adopt this weakly labeled setting, but instead of considering model fitting and the construction of strong labels, we focus on validation, model confidence, and predictive inference, moving beyond point predictions and single labels. Our goal is to develop methods to rigorously quantify the confidence a practitioner should have in a model given only weak labels. First consider the standard supervised learning scenario for data (X, Y ) X Y: here, given a desired confidence level α, the goal, rather than to provide point estimates b Y of Y given X, is to give a confidence set mapping b Cn based on (Xi, Yi)n i=1 that guarantees the distribution-free coverage P h Yn+1 b Cn(Xn+1) i 1 α, (1) where (Xn+1, Yn+1) is a new observation following the same distribution as the first n points. Conformal inference provides precisely these guarantees (Vovk et al., 2005; Lei, 2014; Lei and Wasserman, 2014; Lei et al., 2018; Barber et al., 2021). There are many scenarios, however, where it is natural to transition away from this strongly supervised setting with fully labeled examples. Above we note ranking: individuals are very unlikely to provide full feedback (Ailon et al., 2008; Duchi et al., 2013; Negahban et al., 2016). In multi-label image classification (Boutell et al., 2004; Elisseeffand Weston, 2001), a labeler may identify a few items in a given scene but not all, leading to partial labeled feedback. A major challenge in industrial machine learning deployment is to monitor models once they are in production, where it may be challenging to collect high-quality labels, but weak supervision in the form of clicks on a recommended website, or agreeing to a suggested text message completion is relatively easy and cheap to collect. In all of these, developing valid confidence sets and measures for our predictions is of growing importance, as we wish for models to be trustable, usable, and verifiable. With this as motivation, we consider supervised learning problems where, instead of directly observing the ground truth labels {Yi}n i=1, we observe only noisy partial labeling. We make this formal in two equivalent ways. In the first, for each instance i [n], there exists a (random) function ϕi : Y Yweak belonging to a set Φ {Y Yweak} of partially supervising (or measurement) functions, such that we only observe Y weak i = ϕi(Yi). Equivalently, the pair (Y weak i , ϕi) specifies a weak set Wi Y that contains Yi: Wi := n y Y | ϕi(y) = Y weak i o Y, (2) so that we observe a set Wi consistent with Yi. Instead of strong labels (Xi, Yi) iid P, we thus observe only (Xi, ϕi, Y weak i )n i=1. Two running examples with ranking and multi-label problems will help to illustrate our setting. Predictive Inference with Weak Supervision Example 1 (Ranking): In a ranking problem, the goal is to rank elements of a set of K items, indentified via [K] = {1, . . . , K}, where strong labels y SK, the set of permutations of K elements. The strong label y then specifies item at each rank j, so that y(1) [K] is ranked first. Two natural forms of weak labeling include Top-1 feedback, where a response consists of the item ranked first; when item j has the first rank, this corresponds to the set W = {y SK | y(1) = j} of permutations with j in the first position, so that Y weak = ϕ(Y ) = Y (1), and card(W) = (K 1)! Pairwise comparison feedback, so that for a pair of items j1, j2 [K] specified in ϕ, ϕ(y) = 1 y 1(j1) < y 1(j2) , indicating whether the order y ranks j1 ahead of j2; the set of weak labels W = {y | y 1(j1) < y 1(j2)} thus satisfies and card(W) = k 1 2 (k 2)! = (k 1)(k 1)! Note the duality between the pairs (Y weak, ϕ) and W; working with one or the other is frequently more convenient. 3 Example 2 (Multilabel object recognition): In a multilabel object recognition problem, there are K objects of interest, and on an input image x, the strong label y {0, 1}K indicates which objects appear in the image. A labeler may choose (or recognize) only a subset I [K] of the objects, so that the (random) measurement ϕ(y) {0, 1}K satisfies ϕ(y)j = yj if j I and ϕj(y) = 0 otherwise, that is, Y weak j = Yj if j I and Y weak j = 0 otherwise. In this case, we may represent W as the set of elementwise larger vectors W = {y {0, 1}K | yj Yj for j I}, which has cardinality card(W) = 2K |I|. 3 Throughout, ϕi is a random preference function describing the parts of the ground truth label we observe in the partially labeled dataset. It also captures the information about the ground truth label that matter at test time. In the context of ranking (Ex. 1), this means that an individual cares only that their top-ranked item is first, or that the ranking orders a particular subset of items correctly. Our key assumption is that the partial feedback acquisition distribution and distribution of future measurement functions ϕ coincide, so that providing a label y Y that maps to the weak label, i.e., ϕn+1(y) = Y weak n+1 , is correct. The ranking example 1 makes clear that this assumption is plausible, as an individual presumably is more likely to both provide feedback and care about the elements at the top of their rankings; other domains are similar. Finally, without loss of generality, one can always assume that Yweak = 2Y, as any preference function implicitly maps each element y Y to a subset of Y containing y; see equation (2). We consider two fundamental questions in this weakly-labeled setting: (Q.i) When is it possible to provide (distribution-free) coverage, for example, using true labels Y , weak Y weak or sets W, or other measurements? (Q.ii) What methods can guarantee coverage? While a first goal would be to produce a confidence mapping using (Xi, Y weak i , ϕi)n i=1 guaranteeing the coverage (1), as we prove in Section 2.1, this would in general produce large and therefore uninformative confidence sets. We therefore relax our coverage desiderata, instead Cauchois, Gupta, Ali, and Duchi seeking a confidence set b Cn : X Y that covers the weak counterpart Y weak n+1 = ϕn+1(Yn+1) of the true label in the sense that P h b Cn(Xn+1) Wn+1 = i = P h y b Cn(Xn+1) s.t. ϕn+1(y) = Y weak n+1 i 1 α. (3) The direct application of conventional conformal inference methods (Vovk et al., 2005; Lei, 2014; Barber et al., 2021) to obtain this weak coverage is practically impossible, as the space of weak labels is a priori unknown, is typically quite large. Additionally, feedback in the form of collections of weak label sets is typically unusable; as in Example 1, we wish to provide labels and configurations in the target space Y directly. The condition (3) is weaker than the standard coverage (1), and any confidence set satisfying (1) the former will also satisfy (3), though it allows smaller confidence set sizes. A major challenge is that the function ϕn+1 representing an individual s weak supervision is a priori unknown (e.g., in Example 2 the items in an image a labeler will identify ahead of time). Indeed, if we observe ϕn+1 prior to our prediction, a trivial extension of classical conformal methodology (Vovk et al., 2005; Lei, 2014; Barber et al., 2021) achieves coverage (3): first, construct a valid confidence set mapping b Cn,weak : X Yweak for Y weak n+1 , which as in (1) would guarantee P(Y weak n+1 b Cn,weak(Xn+1)) 1 α. Then define b Cn(x) Y to include a single y Y for each yweak b Cn,weak(x) such that ϕn+1(y) = yweak. Example 1 shows the impossibility of such an approach: we do not know ahead of time if an individual cares only about the top-ranked item or requires a ranking accurate to the 10th item. Given the subtleties of coverage (3), we dedicate Section 2 to question (Q.i) above: what types of coverage are even possible? We devote Sections 3 and 4 to question (Q.ii): the development of methodologies that can guarantee the coverage (3). We first (Sec. 3) provide a general recipe, while in Section 4 we provide more tailored methods for large output spaces, such as those in structured prediction. To provide some initial insights into the methods and potential applications, we provide experiments on several real-world domains; in the main body (Section 5) we investigate ranking, while the appendices (see Appendix C) provide additional examples with structured prediction, matching for pedestrian tracking in videos, and prediction intervals for county-level voting in the United States. 1.1 Related Work An extensive line of work addresses prediction with partially labeled data. The major focus is on strong label recovery under weak supervision, including in multiclass (Cour et al., 2011; Nguyen and Caruana, 2008) and multilabel (Yu et al., 2014) tasks as well as structured prediction problems, such as ranking (H ullermeier et al., 2008; Korba et al., 2018), segmentation (Triggs and Verbeek, 2008; Papandreou et al., 2015), and natural language processing (Fernandes and Brefeld, 2011; Mayhew et al., 2019). More recent work tackles constructing strongly labeled datasets from disparate weak supervision tasks (Ratner et al., 2017; Zhang et al., 2017), while Cid-Sueiro et al. (2014), van Rooyen and Williamson (2018), and Cabannes et al. (2020) provide generic theoretical conditions allowing strong label recovery. Yet this literature focuses primarily on point prediction problems, where a model only returns a single label with the (putative) highest likelihood, in contrast to our confidence-based approach, which provides calibrated uncertainty estimates and guarantees valid confidence sets with virtually no distributional assumptions. Predictive Inference with Weak Supervision Our work also connects to the substantial literature on conformal inference, where the goal is to provide valid predictive confidence sets (1). Vovk et al. (2005) introduce the main techniques that examples are exchangeable, and so essentially can provide p-values for significance of one-another and suggest the simple and generic split-conformal algorithm for building valid confidence sets. Essentially all conformalized confidence sets offer the coverage guarantee (1), so it is of interest to improve various aspects of the mappings b Cn. For example, works focus on improving the precision of these methods and optimizing average confidence set size (Lei et al., 2018; Sadinle et al., 2019; Hechtlinger et al., 2019; Romano et al., 2019a; Angelopoulos et al., 2020), or on bridging the gap with other forms of coverage, like classwise (Sadinle et al., 2019) or conditional (Romano et al., 2019b; Barber et al., 2021; Cauchois et al., 2021; Romano et al., 2020; Cauchois et al., 2020) coverage. Along these lines, Bates et al. (2021) generalize conformal inference to offer error control with respect to loss functions beyond the 0-1 loss (coverage or non-coverage) central to the guarantee (1), taking, as we do, structured prediction problems as motivation. Bates et al. focus on settings where the loss function naturally reflects the structure of the label space Y, such as hierarchical classification problems where one wishes to label an example X at a resolution (level of the tree) appropriate to the confidence with which it can be labeled. We view our approaches as complementary to theirs: their approaches make sense for scenarios with fully labeled data in which a particular loss function is natural, for example in treestructured hierarchical classification, where a prediction can be made at a given level in the tree. Conversely, our approaches are sensible when one receives weakly supervised data and wishes to make a single good prediction; think of a grocery store deciding which of a large collection of shaving creams to stock, a ranking problem where one wishes to make sure that each individual s desired shaving cream is stocked; in the context of Example 1, Y weak is then the preferred shaving cream (top-1), and the guarantee (3) corresponds to top-1 coverage. In that respect, our approach relates to the expanded admission problem (Fisch et al., 2021), which allows for multiple labels to be admissible , except that we do not observe strongly supervised labels. Consequently, we motivate our distinct coverage guarantees from a set of impossibility results we present in the next section. Additionally, we pay special attention (see Sec. 4) to developing practical algorithms that scale to large label spaces, an important consideration with real-world weak supervision. Notation Throughout this paper, [n] stands for the set {1, 2, . . . , n}. We use C : X Y to denote a set valued mapping C : X 2Y := {W | W Y}. P is either the probability distribution generating the data (X, Y, ϕ) X Y Φ, or equivalently (X, Y, W) X Y 2Y, as both notations are equivalent for our purpose, and U Uni[0, 1] defines a uniform random variable on [0, 1]. S(U, V ) is the set of bijections between two sets U and V , and we use the shorthand SK := S([K], [K]) for permutations; (i, j) is the transposition of elements i and j [K], and for k N, k := {p Rk + | p T 1 = 1} is the space of probability distributions on [k]. 2 Conformal inference with weakly supervised data The starting point of this paper is to delineate realistic goals in weak-conformal inference by determining what is actually possible as we show, a form of weak coverage and what is unachievable. To that end, we demonstrate that strong coverage (1), while desirable, Cauchois, Gupta, Ali, and Duchi may yield large and difficult to interpret prediction sets. For example, as a consequence of Corollary 4 to come, in the ranking example 1, if feedback always consists of a paired comparison (e.g., item j1 preferred to item j2), then strong coverage necessitates a prediction set of size at least order k! (k e)k. We thus relax our goals, presenting a general weak conformal scheme (Section 2.2) that relies on weakly supervised data. 2.1 The strong coverage dilemma with partially supervised data Consider a fully supervised classification setting with feature space X and output space Y, and let Pstrong be a joint distribution on X Y representing strong, as opposed to weak, supervision. In this fully supervised setting, we observe (Xi, Yi) iid Pstrong, in contrast to observing a weak label set W Y satisfying only Y W. We first require definitions of consistency and validity. Definition 1 A probability distribution P on (X, Y, W) X Y 2Y is consistent if P(Y W) = 1. For any consistent distribution P, Pweak and Pstrong denote the marginal distributions of (X, W) and (X, Y ), respectively, when (X, Y, W) P. Definition 2 Let b Cn : X Y be a (potentially randomized) procedure depending only on the weakly supervised sample (Xi, Wi)n i=1 X 2Y. Then b Cn provides (1 α)-strong distri- bution free coverage if for all consistent distributions P on X Y 2Y and (Xi, Yi, Wi)n+1 i=1 iid P, we have coverage (1), i.e., P h Yn+1 b Cn(Xn+1) i 1 α, and b Cn provides (1 α)-weak distribution free coverage (3) if P h Wn+1 b Cn(Xn+1) = i 1 α. With these definitions, we can provide the (negative) result that, on average over the data set, any procedure satisfying strong distribution free coverage (1) must include every individual label y Wn+1 with probability at least 1 α. To formalize this, for a confidence set mapping b Cn : X Y constructed with (Xi, Wi)n i=1, define the function pn(x, y) := P y b Cn(x) , which is the probability, taken over the weakly supervised sample (Xi, Wi)n i=1, that b Cn(x) contains the potential label y. We prove the following theorem in Appendix B.1.1. Theorem 3 Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage. Then for all consistent distributions P on X Y 2Y, E{(Xi,Wi)}n+1 i=1 iid Pweak inf y Wn+1 pn(Xn+1, y) 1 α. Predictive Inference with Weak Supervision Theorem 3 essentially states that b Cn simultaneously includes each element y Wn+1 with probability at least 1 α. The theorem is generally not improvable, as Wn+1 need not be a subset of b Cn(Xn+1). Indeed, think of the trivial procedure b Cn that includes every label y Y independently with probability 1 α: it obviously satisfies strong distribution-free coverage but has no connection with Wn+1. As an additional immediate corollary, if the sets W contain at least a fixed number of labels, then so does b Cn(Xn+1). Corollary 4 Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage, and that P(|W| L) = 1 for some L 1. Then E{(Xi,Wi)}n+1 i=1 iid Pweak h b Cn(Xn+1) i L(1 α). Proof By Theorem 3, E h b Cn(Xn+1) i = E y Y pn(Xn+1, y) E |Wn+1| inf y Wn+1 pn(Xn+1, y) L(1 α) as claimed. Recalling Example 1, top-item feedback necessitates (k 1)! sets for ranking; multi-label recognition (Ex. 2) similarly necessitates an exponentially large set b Cn. An alternative perspective is to consider large-sample limits; often, the procedure b Cn converges to some population confidence set mapping C : X Y as n , in that E h b Cn(X) C(X) i 0 (4) as n , where the expectation is over both the construction of b Cn and X independent of (Xi, Wi) iid Pweak. Typically, the limiting C is a (nearly) deterministic function1 of x; for example, the standard construction (e.g. Vovk et al., 2005; Lei, 2014; Barber et al., 2021) takes C(x) = {y Y | s(x, y) τ} for some scoring function s : X Y R and threshold τ, which is deterministic. In this case, we can show that we nearly have W C(X), so C(X) must be large whenever W is. To formalize, let Det C(x) := {y Y | P(y C(x)) {0, 1}} be the labels that are deterministically in or out of C(x) (where the probability is over any randomization in the mapping C) so that Det C(x) = Y whenever C is deterministic. Then can show that W C(X) with probability at least 1 α: Corollary 5 Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage and satisfies the limit (4). Then P(W Det C(X) C(X)) 1 α. 1. In some cases, we use randomization over a single label to guarantee that P(Y C(X)) = 1 α Cauchois, Gupta, Ali, and Duchi Appendix B.1.2 proves a slightly stronger result (which we state as Corollary 8). Theorem 3 and its corollaries suggest that any procedure achieving strong (distributionfree) coverage necessarily produces inefficient (large) confidence sets when one uses only weakly supervised data. Even in cases where there is implicitly a single correct label, such as the structured prediction problems Cabannes et al. (2020) consider, where the weak labels w that a single x supports (those for which P(W = w | X = x) > 0) have a single label y in their intersection w:P(w|x)>0{w} = {y}, large weak sets W remain possible. We thus must take a different tack, targeting new coverage desiderata. An aside: regression. Our development applies to regression or other problems with continuous or infinite response sets, e.g., Y = R, as nothing in Theorem 3 or Theorem 9 to come requires Y to be any particular space. We leverage this in our experiments (Sec. 5 and Appendix C) to give numerical examples, touching on the R-valued case here to demonstrate the analogues of our theoretical results. As an example, weak sets W in the continuous case may be intervals, arising, for example, from measurements with limited resolution. We adapt Corollary 4 to regression by replacing counting measure with the Lebesgue measure Leb, where the response set Y = R and the weak sets W R. Assuming that the weak sets all have a minimal volume, any valid confidence set mapping necessarily is large (on average) as well: Corollary 6 Suppose that b Cn : X Y = R provides (1 α)-strong distribution free coverage, and let L > 0. If P(Leb(Wi) L) = 1, for i = 1, . . . , n + 1, then E h Leb b Cn(Xn+1) i L(1 α). This follows because any measure on Y gives an analogous result: Corollary 7 Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage, let L > 0, and µ a measure on Y. If P(µ(Wi) L) = 1 for i = 1, . . . , n + 1, then E h µ b Cn(Xn+1) i L(1 α). Proof By Fubini s theorem and Theorem 3, E h µ b Cn(Xn+1) i = E Z Y 1 n y b Cn(Xn+1) o dµ(y) Y p(Xn+1, y)dµ(y) E inf y Wn+1 p(Xn+1, y)µ(Wn+1) L(1 α), as claimed. The extension of Corollary 5 follows from Corollaries 6 and 7, implying Corollary 5 as a special case. (See Appendix B.1.2 for the proof.) Corollary 8 Let µ be any measure on Y such that µ(W) > 0 with probability 1. Suppose that b Cn : X Y provides (1 α)-strong distribution free coverage and C : X Y is its limiting mapping, i.e., limn E[µ( b Cn(X) C(X))] = 0. Then P(W Det C(X) C(X)) 1 α, and in particular, if C is deterministic, then P(W C(X)) 1 α. Predictive Inference with Weak Supervision 2.2 A general weak-conformal scheme via scoring functions The theoretical limitations we identify motivate the weak coverage (3) we target instead of the strong converage (1). Following our discussion above, the new coverage definition stems from two desiderata: if the problem is actually low-noise and there already exists a highly predictive model we can leverage to build our confidence sets roughly, that conditional on x, a single label y belongs to the weak sets W with high probability and a model exists that can predict this y then while we should return this singleton even if we cannot guarantee strong coverage. In the alternative perspective that we care only about the value ϕ(Y ) recall the weak set (2) providing any y satisfying ϕ(y) = Y weak should suffice for prediction. We turn now to provide our general weak conformalization scheme. Our starting point is via the typical output of a machine-learned model, a scoring function s : X Y R that ranks potential labels (or responses) y for an input example x X. We treat s(x, y) is a non-conformity score, meaning the model predicts that values of y for which s(x, y) is small are more likely. Standard examples of such scoring functions include s(x, y) := |y bµ(x)| in regression, where bµ : X R predicts y | x; or s(x, y) := log py(x) in multiclass classification, where py(x) models the conditional probability of y | x. Throughout this section, we adopt a split-conformal perspective (Vovk et al., 2005; Barber et al., 2021), assuming the practitioner provides a scoring function independent of the sample (Xi, ϕi, Y weak i )n i=1 (the sample would typically be a validation set), and we show how to transform any such scoring function into a valid weakly-covering confidence mapping. Algorithm 1 Partially supervised conformalization Input: sample {(Xi, Y weak i , ϕi)}n i=1; score function s : X Y R independent of the sample; desired coverage 1 α (0, 1) For each i [n], compute Si := min y: ϕi(y)=Y weak i s(Xi, y). (5) Set btn := (1 + n 1)(1 α)-quantile of {Si}n i=1. Return: predictive set mapping b Cn : X Y defined by b Cn(x) := y Y | s(x, y) btn . Algorithm 1 starts from a simple observation, assuming that the scoring function s is accurate, so that likely y achieve small scores s(x, y). Given a query function ϕ and weak label Y weak equivalently, the weak set W = {y | ϕ(y) = Y weak} the most likely label should typically be the y minimizing s(x, y) over all y satisfying ϕ(y) = Y weak. An equivalent scheme to the scores (5) with label mappings ϕ and weak labels Y weak i uses weak sets Wi, where we replace the scores (5) with Si := min y Wi s(Xi, y). Cauchois, Gupta, Ali, and Duchi As a concrete example, for R-valued predictions with s(x, y) = |y bµ(x)| and interval weak sets Wi = [li, ui], we have Si = [li bµ(Xi)]+ + [bµ(Xi) ui]+. In either case, Algorithm 1 achieves valid weak coverage (3): Theorem 9 Let (Xi, Yi, ϕi)n+1 i=1 iid P and Y weak i = ϕi(Yi) for i [n + 1]. Then Algorithm 1 returns a confidence set mapping satisfying P h There exists y b Cn(Xn+1) s.t ϕn+1(y) = ϕn+1(Yn+1) = Y weak n+1 i 1 α. Proof Let Si := minϕi(y)=Y weak i s(Xi, y) for each i [n + 1]. By definition of b Cn, we have n y b Cn(Xn+1) | ϕn+1(y) = Y weak n+1 o = n y Y | ϕn+1(y) = Y weak n+1 and s(Xn+1, y) btn o , which is nonempty if and only if Sn+1 := min ϕn+1(y)=Y weak n+1 s(Xn+1, y) btn. As {Si}n+1 i=1 are i.i.d., this occurs with probability at least 1 α (e.g. Tibshirani et al., 2019, Lemma 1). 3 Constructing effective conformal prediction sets Algorithm 1 provides a generic method for conformalization in the presence of partially supervised data, and it makes no assumptions on the input score function s. Though the coverage guarantee (3) holds regardless, we can delineate a few additional desiderata that the predictive sets and score functions s should satisfy to make them more practically useful, which is our focus in this section: The score function s must allow the practitioner to efficiently carry out the computation of the partial infimum scores (5). The lower level sets b Cn(x) = {y Y | s(x, y) btn} should be efficiently representable. The confidence sets b Cn(x) should be small, as smaller confidence sets (for a fixed confidence level α) carry more information. Deferring our discussion of computational efficiency to Section 4, in this section we only focus on the last desideratum, implicitly assuming computation is tractable (for example, that Y is small). We first (Sec. 3.1) develop conditions sufficient for optimally-sized confidence sets to even exist a few subtleties arise before giving greedy algorithms for confidence set-size minimization, describing their properties, providing a few optimality guarantees in Section 3.2, and connecting to submodular minimization in Appendix A. Predictive Inference with Weak Supervision 3.1 Size-optimal scoring mechanism As in standard approaches to conformal inference (Lei, 2014; Lei and Wasserman, 2014; Barber et al., 2021), we aim to construct a confidence set mapping b Cn with minimal average size over X PX. Our starting point is simply to define size-optimality, where to achieve exact coverage and size guarantees, we allow randomization of our confidence sets via an independent variable U Uni[0, 1]. Definition 10 A randomized confidence set mapping C1 α : X [0, 1] Y is marginally size-optimal at level α if it solves minimize C:X [0,1] Y EX,U Uni[0,1] [|C(X, U)|] subject to P(W C(X, U) = ) 1 α. (Marg) It is conditionally size-optimal at level α if for almost every x X, C(x, ) solves minimize C:[0,1] Y EU Uni[0,1] [|C(U)|] s.t. P(W C(U) = | X = x) 1 α . (Cond) Even with full knowledge of the distribution P, techniques for finding marginally sizeoptimal confidence sets (Marg) are not immediately apparent; as a consequence, we focus on the conditional case first. Even in this case, it is in general non-trivial to obtain smallest confidence sets. Yet as we follow the standard practice in conformal prediction of defining confidence sets via the scores s (recall Alg. 1) as Ct(x) = {y | s(x, y) t}, our confidence sets have the natural nesting property that Ct(x) Ct (x) whenever t < t . Abstracting away the particular form of C to enable a purely set-based focus, we thus consider nested confidence sets, where we show that optimality guarantees are possible. Definition 11 A collection of mappings {Cη : X [0, 1] Y}η (0,1) is nested if P(Cη1(X, U) Cη2(X, U)) = 1 for all 0 < η1 < η2 < 1. There is an immediate equivalence between score-based conformalization schemes and nested collections of confidence mappings (Gupta et al., 2022): we simply define snest(x, y, u) := inf {η (0, 1) | y Cη(x, u)} . (6) The next lemma formalizes this equivalence (see Appendix B.2.1 for a proof). Lemma 12 Assume the confidence set mappings {Cη}η (0,1) are nested and snest(x, y, U) has continuous distribution for U Uni[0, 1]. Then Cη(x, U) = y Y | snest(x, y, U) η with U-probability 1. That is, obtaining weak coverage for nested confidence mappings is equivalent to obtaining weak coverage using the scoring function snest, which Alg. 1 provides; that is, it is equivalent to choosing the smallest η (0, 1) such that P(W Cη(X, U) = ) 1 α. A second useful distributional property of the nested scores (6) is that, assuming the confidence sets Cη are conditionally valid, we can provide strong distributional results on snest. To make this Cauchois, Gupta, Ali, and Duchi precise, we say that Cη is conditionally valid for the weak labels W if for each η (0, 1) and with P-probability 1 over X, P (Cη(x, U) W = | X = x) = η. (7) We then have the following uniformity property as an immediate consequence of Lemma 12: Lemma 13 In addition to the conditions of Lemma 12, assume that Cη is conditionally valid (7) for the weak label W. Then the minimum score (5) is independent of X and satisfies inf y W snest(x, y, U) Uni[0, 1]. Proof With U-probability 1, infy W snest(x, y, U) η if and only if Cη(x, U) W = , and so P(infy W snest(x, y, U) η | X = x) = P(W Cη(x, U) = | X = x) = η. Einbinder et al. (2022, Prop. 1) gives a similar result to Lemma 13, where the conformity scores they introduce also induce prediction sets satisfying the nested property. To illustrate this lemma, suppose that there exist nested conditionally size-optimal mappings {Ccond η }η (0,1) solving problem (Cond): in that case, they satisfy the conditions for application of Lemma 13 so that the induced scores Si are uniform; Alg. 1 will thus compute btn = (1 α)+OP (n 1/2) as btn is the (1 α) quantile of Si iid Uni[0, 1]. So in the case that we have (near) conditional coverage Alg. 1 maintains it. Notably, given a score function s, not necessarily the nested score (6), but strong in the sense that it models (X, Y ) well enough that for each α, we can choose t so that P(s(x, Y ) t | X = x) = α, then the confidence sets Algorithm 1 returns are indeed nested, and Lemma 13 applies to the induced nested score snest. Optimal nested sets need not always exist (see Example 3 below), but we can provide natural conditions on the distribution of W | X = x sufficient to allow such nested coverage, which we do in the next subsection. 3.1.1 From conditionally to marginally valid confidence sets Our initial criterion (Marg) is purely marginal: we wish to compute a marginally sizeoptimal confidence set. Conveniently, conditionally size-optimal mappings can yield marginally size-optimal problems. In particular, assume that the mappings {Ccond η }η (0,1) are conditionally size-optimal (Cond) and satisfy P(W Ccond(x, U) = | X = x) η. The following proposition shows how to transform these into marginally size-optimal confidence sets. Proposition 14 Let the mappings {Ccond η } be conditionally size-optimal (Cond) as above, and define the average size(x, η) := EU[|Ccond η (x, U)|]. Let smarg be any minimizer of E [size(X, s(X))] s.t. E[s(X)] 1 α over s : X [0, 1]. Then a solution to the initial marginal problem (Marg) is Cmarg 1 α (x, u) := Ccond smarg(x)(x, u). Predictive Inference with Weak Supervision More directly, any conditionally size-optimal sets which are at least easier to characterize as they need only randomize over U Uni[0, 1] yield marginally size-optimal confidence sets in a relatively straightforward way: one chooses the probability of miscoverage, s(x), minimizing the expected confidence set size. Proof That CMarg 1 α provides valid 1 α coverage is nearly immediate: by conditional size optimality, we have P(W Cmarg 1 α (X, U) = ) = E[tmarg(X)] 1 α. Let C be any confidence set mapping such that P(W C(X, U) = ) 1 α, and define s C(x) := P(W C(x, U) = | X = x) [0, 1], which satisfies E[s C(X)] 1 α. By assumption on Ccond, for each fixed x X, the set Ccond s C(x)(x, U) is size-optimal (Cond) at level s C(x), so that for PX-almost every x X, we have size(x, s C(x)) = EU Uni[0,1] h |Ccond s C(x)(x, U))| i EU Uni[0,1] [|C(x, U)|] . Integrating both sides of the inequality over X PX, and using the assumed optimality condition on smarg, we obtain E [size(X, smarg(X))] E [size(X, s C(X))] EX,U [|C(X, U))|] . The left-hand size is the average size of Cmargin 1 α . 3.2 Greedy algorithms for confidence set-size minimization Given the distribution or a model of the distribution of the weak set W conditional on x, we propose a natural greedy algorithm to construct a confidence set satisfying the weak coverage constraint: at each step, Algorithm 2 adds the label that increases coverage the most until the confidence set achieves a desired level. Algorithm 2 draws inspiration from Romano et al. s Algorithm 1 2020, where the authors formulated conformal inference methods tailored for categorical and unordered response labels. These methods not only ensure valid marginal coverage but also afford approximate conditional coverage. As we show presently, there are natural families of distributions where this greedy algorithm is optimal; however, there are failure modes, of which we also provide an example. In Appendix A, we relate this greedy construction to submodular optimization to provide general guarantees of confidence set size and coverage. Alg. 2 returns a nested sequence {Cgr η (x, U)}η (0,1), where U Uni[0, 1] randomizes to achieve an appropriate level. While the sequence need not necessarily solve problem (Cond) (see Example 3 to come), there are natural sufficient conditions for Algorithm 2 to return a size-optimal set, of which we present two. As the first particular case, consider that conditional on x, labels y Y belong to W independently: Definition 15 A probability distribution P on W 2Y has label-independent structure if {1{y W}}y Y are independent random variables when W P. We might expect W to exhibit label independence when all labels y Y satisfy π(y | x) 1, with the exception of a single label y (x), for which π(y (x) | x) 1, as will often be the case in low-noise classification settings. Cauchois, Gupta, Ali, and Duchi Algorithm 2 Greedy weakly supervised scoring mechanism Input: model for the distribution of W given X = x; coverage rate η (0, 1) for each j [K] define recursively yj(x) := argmax y Y P i=1 {yi(x) W} | X = x for each j [K] define Cgr,j(x) := {yi(x) | i j} and set j(x, η) := min j [K] | P W Cgr,j(x) = | X = x η . tη(x) := η P(Cgr,j(η,x) 1(x, U) W = | X = x) P(Cgr,j(x,η)(x, U) W = | X = x) P(Cgr,j(x,η) 1(x, U) W = | X = x). return function Cgr η : X [0, 1] Y defined by Cgr η (x, u) := ( Cgr,j(x,η)(x) if u < tη(x), Cgr,j(x,η) 1(x) otherwise. Figure 1. A tree-structured (8) distribution for W given X = x, with Y = {1, 2, 3, 4}. The possible configurations for W are the singletons {y1}, {y2}, {y3}, {y4}, the two pairs W1 = {y1, y3} and W2 = {y2, y5}, W0 = {y1, y2, y3, y5}, and Y itself. Another scenario occurs when the label space exhibits a hierarchical tree structure, as one may expect in image classification (Deng et al., 2009) or structured prediction tasks (Cabannes et al., 2020). When the weak sets W obey the same structure as the distribution they are subtrees of the global tree we say the labels have a tree structure (see Figure 1): Definition 16 A probability distribution P on W 2Y has a tree structure if for all w1, w2 Y, P(W = w1) > 0 and P(W = w2) > 0 imply w1 w2 {w1, w2, }. (8) Both definitions (independent labels and hierarchically-structured weak labels) are sufficient to guarantee size-optimality for the greedy confidence sets Algorithm 2 constructs. The next Proposition, whose proof we provide in Appendix B.2.2, makes this formal. Predictive Inference with Weak Supervision Proposition 17 Suppose the probability law L(W | X = x) has either label-independent structure (Def. 15) or a tree structure (Def. 16). Then for all η (0, 1), Cgr η is conditionally size-optimal, and therefore is a minimizer in equation (Cond). In general, even with perfect knowledge of the distribution of W | X = x, the nested greedy confidence sets Cgr η need not be size-optimal, as there may be weak sets appearing with high probability while their constituents do not, so that the conditionally size-optimal sets {Ccond η }η [0,1] are not nested. The next example illustrates one such failure mode: Example 3: Let the distribution of W be {1, 2} w.p. 0.3 {1, 3} w.p. 0.25 {2} w.p. 0.2 {3} w.p. 0.15 {1} w.p. 0.1. Then for η = 0.9, it is immediate that Ccond η (x, u) = {2, 3}, but Cgr η (x, u) = {1, 2, 3} or {1, 2} depending on whether u < 1/3. In addition, Ccond eta (x, u) = {1, 3} when η = 0.85, showing that in this case, the confidence set mappings Ccond η need not be nested. 3 In Appendix A, we include a few ancillary results showing that even in general cases like Example 3, the sizes of the confidence sets Cgr and Ccond cannot be too far apart. 4 Efficient conformalization for large output spaces While Section 3 provides a generic treatment on for producing scoring functions and associated confidence sets of minimal size, in typical practice, a (pre-trained) model provides a predictive scoring function, which may not be directly associated to a probability metric, and we wish to leverage such models. This is of particular interest when the label space Y is large, as in structured prediction problems (Taskar, 2005; Cabannes et al., 2020), where computational efficiency becomes a main challenge. In this section, we thus first introduce a general method for computing and representing confidence set mappings of the form {y | s(x, y) ˆtn}, and then describe how to efficiently carry out Alg. 1 in ranking problems (Section 4.2); for interested readers, we include constructions for matching problems in Appendix C.1 problems. 4.1 Conformal confidence sets with sequential partitioning We seek to efficiently compute and represent the confidence set b Cn(x) for any instance x X, typically for a task where the label space contains more configurations than are efficiently enumerable (K! for matching and ranking problems over K items). At the same time, recalling that btn denotes the threshold Algorithm 1, if our confidence sets are to be informative they should include relatively few configurations y Y satisfying s(x, y) btn. To the end of computing the set b Cn(x) = {y | s(x, y) btn} in Alg. 1, we focus on methods for computing a given number M of configurations with the smallest score s(x, y). This is essentially without loss of generality: while we may not know the appropriate M = Mx = Cauchois, Gupta, Ali, and Duchi Partition Update y3 1,2 y3 3,2 y3 1,2 y3 3,2 y3 y3 2,2 = y4 y4 1,2 y4 3,2 Figure 2. Alg. 3 scheme for sequential partitioning: first, partition the subset containing the m + 1-th best configuration, y3 2,2 in this case, then compute both second-best configurations in the newly formed subsets of the partition here Y4 2 and Y4 4. | b Cn(x)| to guarantee coverage, if for each M N we can find the M best configurations in time polynomial in M, then by sequentially doubling M until we obtain an element y Y such that s(x, y) > btn, we achieve time polynomial in Mx. Algorithm 3 builds on this intuition to return a valid confidence set. The Algorithm we suggest is essentially an extension of the algorithm proposed by Chegireddy and Hamacher (1987) to find the Kbest matchings in a bipartite graph. We reuse the general idea (i.e. compute a sequence of partitions of the space and maintain a list of the two best configurations for each item of the partition) and extend it to a more general structured prediction. The key is to observe that we only need to be able to compute the two best configurations of a given subset of configurations, which they do on matching problems (and our Algorithm essentially reduces to theirs in the matching case). We then apply that paradigm to the ranking case. We remark briefly that an alternative approach is to conformalize directly on the size M of the confidence set: suppose we learn a function c M : X N predictive of the rank (according to {s(x, y)}y Y) of the first compatible configuration, i.e predictive of Mi := rank of the first configuration y Y such that ϕi(y) = Y weak i . In that case, if we let b Qn := 1 + n 1 (1 α) -quantile of {Mi c M(Xi)}n i=1, we would only need to return b Cn(x) := n c M(x) + b Qn best configurations y Y ordered by s(x, y) o . This approach makes prediction more efficient (as we know in advance the number of configurations to compute), but the computational effort of the conformalization step (5) increases, as we must compute the rank of the best constrained configuration for each instance. 4.1.1 Returning M best configurations with sequential partitioning Let us now fix M 1, and focus on retrieving the M configurations with the lowest scores. Algorithm 3 provides a general recipe using dynamic programming, and it is efficient as long as we can efficiently compute certain partitions of the label space. We require the following definition. Predictive Inference with Weak Supervision Definition 18 A function Partition : 2Y Y Y 2Y 2Y is valid for a score function s if, for every subset Y Y and pair of configurations y1, y2 Y satisfying y1 argmin y Y s(x, y) and y2 argmin y Y\{y1} s(x, y), Partition( Y, y1, y2) returns a partition ( Y1, Y2) of Y such that y1 Y1 and y2 Y2. We thus leverage two conditions: a valid Partition for our score function s and, for each pair of subsets Y1, Y2 Y that it produces, we must be able to (efficiently) compute the second-best configurations in Y1 and Y2, i.e., y1,2 argmin y Y1\{y1} s(x, y) and y2,2 argmin y Y2\{y2} s(x, y). Figure 2 encapsulates the main idea Alg. 3: at each step m [M], we maintain a partition {Ym j }m j=1 of Y such that if ym j argminy Ym j s(x, y), then for all j [m], we have ym j argmin y Y\{ym 1 ,...,ym j 1} s(x, y), i.e., ym j is the j-th best configuration in Y. Now, for each j [m], let the configuration ym j,2 argminy Ym j \{ym j } s(x, y) be the second-best configuration in Ym j . The key is then to observe that if we set ind(m) := argmin j [m] s(x, ym j,2), then ym ind(m),2 is the (m+1)st best configuration in Y. The Partition function then divides Ym ind(m) into two sets Ym+1 ind(m) and Ym+1 m+1 such that ym ind(m) Ym+1 ind(m) and ym ind(m),2 Ym+1 m+1. Under the assumption that Partition is valid (Def. 18) for the score s, the following lemma guarantees the validity of Algorithm 3. Lemma 19 Assume the Partition function is valid for the score function s. Then Algorithm 3 returns a set of configurations {yj}M j=1 such that for each j [M], yj argmin y Y\{y1,...,yj 1} s(x, y). Proof This follows by an induction over m 1, which guarantees that at every step m 1, {Ym j } is a partition of Y such that ym j = argminy Ym j and s(x, ym 1 ) s(x, ym 2 ) s(x, ym m) min y Y\{ym j } s(x, y). The property transitions from m to m+1 as the Partition function is valid, and we choose ym+1 m+1 as the best second-best configuration, hence it is the (m + 1)st best configuration. The existence of an efficient valid Partition function is instance-dependent and typically requires a specific choice of scoring function; we provide concrete implementations for two types of structured prediction problems. Cauchois, Gupta, Ali, and Duchi Algorithm 3 Sequential partitioning Require: score function s : X Y R; valid (Def. 18) Partition: 2Y Y Y 2Y 2Y; instance x X initialize: Compute y1 1 := argminy Y s(x, y) and y1 1,2 := argminy Y\{y1 1} s(x, y). Set Ym 1 := Y {Initialize the partition} for m = 1, 2, . . . , M 1 do ind(m) := argminj [m] s(x, ym j,2) {Find the m + 1-th best configuration} ym+1 m+1 := ym ind(m),2 for j [m] \ {ind(m)} do (Ym+1 j , ym+1 j , ym+1 j,2 ) := (Ym j , ym j , ym j,2) {All subsets {Ym j }j =ind(m) remain identical} end for Ym+1 ind(m), Ym+1 m+1 := Partition(Ym ind(m), ym ind(m), ym+1 m+1) {Partition the set Ym ind(m)} ym+1 ind(m) := ym ind(m) and ym+1 ind(m),2 := argminy Ym+1 ind(m)\{ym+1 ind(m)} s(x, y) ym+1 m+1,2 := argminy Ym+1 m+1\{ym+1 m+1} s(x, y) {Compute second-best configurations} end for return {y M m }M m=1 4.2 Structured prediction examples (Ranking problems and partial labeling mechanisms) While Algorithm 3 is generic, we now show that efficient partitioning and minimization functions exist in structured prediction instances, so that we may efficiently carry out above algorithms in the instance. Here we focus on ranking problems and defer the discussion on matching tasks in Appendix C.1. The goal here is to predict a preference ranking y Y = SK of K different items, documents, for a certain user or query x X, where y(i) denotes the item of rank i. Typically, one achieves this by learning relevance functions rk : X R, which evaluate each item 1 k K individually before aggregating into a single ranking prediction (Freund et al., 2003; Duchi et al., 2013; Qin and Liu, 2013; Cao et al., 2007). We assume here that we have access to such relevance functions. In ranking tasks, there are two reasonable ways in which practitioners may acquire partial supervision or user feedback. The first mechanism (Cabannes et al., 2020) assumes they only receive a subset of all K 2 pairwise comparisons 1 y 1(i) < y 1(j) 1 i 0 when a < b, non-increasing in the first argument and non-decreasing in the second. Unless we specify otherwise we use ψ(a, b) = [b a]+ in our experiments. Finding the configuration y that minimizes the partial score (5) of a ranking-consistent score function is straightforward: it suffices to rank all the elements in [K] \ {y(i)}Kpartial i=1 according to their relevance scores (rj(x))K j=1, and then append them to the first Kpartial elements. This property allows efficiently retrieving the M 1 best configurations with Alg. 3. Throughout the loop, we make sure that any set of permutations Ym j is a subset of permutations consistent with a finite number of partial rankings (pairwise comparisons), and that its best and second-best configurations ym j and ym j,2 only differ by a neighboring transposition of the form (i + 1, i), satisfying ym j,2 := argmin y Ym j {s(x, y) | i [K], y = (i + 1, i) ym j }. (11) If we can guarantee this loop invariant, then there always exists ij,m [K] such that ym j,2 = (ij,m + 1, ij,m) ym j , and we only need to define the partition function on a smaller subset of 2Y Y Y: for any subset of permutations Y Y, y Y and i [K] such that (i + 1, i) y Y, we let Partition Ranking( Y, y, (i + 1, i) y) := Y y Y | y 1( y(i)) < y 1( y(i + 1)) , Y y Y | y 1( y(i)) < y 1( y(i + 1)) , (12) splitting Y according to whether y(i) has a higher rank than y(i + 1). The next lemma, whose proof is in Appendix B.3.1, states that this partition rule indeed guarantees that, at every step m of the loop in Algorithm 3, the second-best configuration in Ym j satisfies the invariant (11). Lemma 21 Assume the score function is ranking-consistent (9) for a set of relevance functions {rk}K k=1. Then Algorithm 3 with the Partition Ranking function (12) produces a sequence of partitions with second-best configurations satisfying equation (11). That is, Algorithm 3 is correct. Cauchois, Gupta, Ali, and Duchi 5 Experiments In this section, we test our weakly supervised methods experimentally, in different classification and regression problems, on both synthetic and real datasets, with an emphasis on their computational efficiency and informativeness. Here we focus on ranking problems and present further experiments on matching and regression problems in Appendix C. The primary goal of this paper is not to provide end-to-end models with only partially supervised data, but rather to introduce a new form of coverage validity and show how to achieve it with partially labeled data. In contrast to the split-conformal method (Vovk et al., 2005), which requires fully supervised instances for both training and validating, we only need these to train a model and form a scoring function suitable for the application of Alg. 1. In some cases, standard models already exist, such as in image classification (He et al., 2016). To provide a meaningful comparison with existing conformal methods and test for predictive set size efficiency, we use fully labeled real datasets, and introduce different plausible forms of weak supervision on our calibration and test sets before applying Algorithm 1 to construct confidence sets. Our method displays similar behavior across all datasets and forms of partial information. To provide a baseline, we also run a standard fully supervised conformal scheme (FSC) using the strong labels Yi and true scores s(Xi, Yi), which runs similarly as Alg. 1, but with threshold btfull n := (1 + n 1)(1 α)-quantile of {s(Xi, Yi)}n i=1. (13) We can then estimate the gain in efficiency in the form of decreased confidence set sizes that stems from the weakening of strong coverage (1) to weak coverage (3). 5.1 A toy classification example We first perform an experiment with a toy multiclass data set containing K = 10 different classes and d = 2 dimensional features. We consider a partially supervised problem on X Y = Rd [K] for which we wish to output valid confidence sets. We use the following model: each potential response y [K] has a noisy score depending on the feature vector X Rd though a vector θ y Rd, {Soracle y }y [K] | X = x N({x T θ y}y [K], σ2IK) (14) Ideally, we would recover the strong label Y := argminy Y Soracle y , but our weakly supervised methods do not observe Y directly: instead, for a random instance-dependent threshold T, we only have access to the weak set W := {y Y | Soracle y T}. As motivation, consider a supervised learning task in which, out of all potential responses, there is always only one ground truth, but there are other labels that are good enough (i.e. have a low enough score) to answer a certain query. In this setting, a confidence set is weakly valid (3) as long as it contains at least one label y such that Soracle y T, whereas it is strongly valid (1) if it contains Y . We vary the signal-to-noise ratio σ 1 {10 2, . . . , 102}: when it is too small, no model (even an oracle one) can be highly predictive, and a standard conformal method should Predictive Inference with Weak Supervision Strong Weak 0.03 0.1 0.3 1 3 10 30 100 0.03 0.1 0.3 1 3 10 30 100 factor(SNR) 0.1 1.0 10.0 100.0 SNR Average Set Size σ 1 σ 1 σ 1 E h ˆC(X) i P(Y b C(X)) P(W b C(X) = ) Figure 3. Results for the simulated multiclass data (14), over Ntrials = 20 trials. The left plot shows respectively the strong (1) and weak (3) coverage for the greedy weakly supervised (GWS), the weakly supervised conformal (WSC) and the fully supervised conformal (FSC) confidence sets. The right plot displays the average confidence set size for these methods. provide large uninformative confidence sets, whereas we expect our new definition of coverage to yield smaller sets, as any label in W (i.e. with a low enough score) provides valid coverage. In this experiment, we compare three different methods. The Greedy weakly supervised (GWS) method only uses partially labeled data both when training and conformalizating. It first trains K separate logistic regressions with {Xi} as features and each {1{y Wi}} for all y Y as potential response, providing a model for P(y W | X = x), and models the distribution of W given X = x as label-independent (see Definition 15). It then computes a nested sequence of confidence sets thanks to Alg. 2, which we then feed to the conformalization Algorithm 1 using the nested scoring mechanism (6). The second and third methods, the Weakly supervised conformal (WSC) and the Full supervised conformal (FSC) methods respectively, use fully supervised data for training: we first train a standard logistic regression model pθ(y | x) exp(θT y x) on {(Xi, Yi)}, and then construct a scoring function using the Generalized Inverse Quantile (GIQ) procedure that Romano et al. (2020) introduce. In the conformalization step, the WSC method runs Algorithm 2 with partially labeled calibration data, while the FSC method uses strongly labeled data to compute the threshold btfull n in (13). The threshold btn in Alg. 2 is always smaller than btfull n , so the FSC method returns larger confidence sets than the WSC method. We expect that as the signal to noise ratio decreases, the gap between the GWS and WSC confidence sets and the FSC confidence sets increases. Cauchois, Gupta, Ali, and Duchi The precise experimental set-up is as follows: we simulate n = 104 data points, splitting them into training (30%), calibration (20%) and test (50%) sets. We draw each θy uniform on Sd 1, {Xi}n i=1 iid N(0, Id), choosing weak threshold T Uni[miny Y{SOracle y }, maxy Y{SOracle y }]. We repeat the entire process Ntrials = 20 times to account for uncertainty, presenting our results in Figure 3. As we expect, using an alternative weaker version of coverage (3) allows us to significantly decrease the size of the confidence set (by up to a factor of 3), especially when the signal-to-noise ratio is small, as one must include more classes in the confidence set to maintain strong coverage. Indeed, we can see that the strong coverage (1) for the GWS and WSC procedures fall well below 1 α = 95% in this case, since they only strive for weak 1 α coverage, which they consistently achieve. Since the GWS method aims to construct minimal confidence sets, we expect that it produces smaller confidence sets than the WSC method, which simply leverages an existing strongly supervised model; we consistently observe this across different values of σ. 5.2 Document ordering for query answering We now present the results of two experiments using Alg. 1 in a ranking problem. The first simulates a standard ranking task, while the second focuses on ranking documents relevance to specific queries in the Microsoft LETOR dataset (Qin and Liu, 2013). 5.2.1 Ranking simulation study In a first simulation study, we aim to predict a ranking of labels y [K] based on a feature vector X Rd. Think here of a supervised problem where we want to rank users preferences for a set of items. Each user has an unknown relevance score SOracle y R for each item y [K], which induces a ground truth ranking over the labels: Y := argsort{SOracle y }y [K] SK. The problem is to recover this noisy ranking and produce valid confidence sets in SK, but our weakly supervised methods do not observe the full ranking when conformalizing: they can only observe the ranking up to the Kpartial K-th element, leading to the weak set W = {y SK | j [Kpartial], y(j) = Y (j)}. (15) In our experiment, we simulate n = 104 i.i.d. different users, using the same (30,20,50) train/validation/test split as in Section 5.1. With K = 7 and d = 2, we draw the user feature vector Xi iid N(0, Id), and then conditionally on Xi, we produce normal itemwise relevance scores {SOracle iy }i [n],y [K] following the distribution (14). We finally simulate partial supervision by drawing the number of observed elements in the ranking Kpartial i as min(K, 1 + Ai), where Ai iid Poi(.5). The lower left panel of Figure 4 shows the overall distribution of this quantity: most users only reveal the first 1 to 3 items in their optimal ranking. We then produce strongly and weakly valid confidence sets at the 1 α := 90% level. We use the same scoring model for both the fully supervised conformal (FSC) and weakly Predictive Inference with Weak Supervision Strong Weak 0.03 0.1 0.3 1 3 10 30 100 0.03 0.1 0.3 1 3 10 30 100 0.4 factor(SNR) 1 2 3 4 5 6 7 size 1e 01 1e+00 1e+01 1e+02 1e+03 Size CDF P(|C(X)| <= Size) P(Kpartial i = t) Size t Size t P | b C(X)| t) P(Y b C(X)) P(W b C(X) = ) A Figure 4. Results for the ranking simulation study (15) over Ntrials = 20 trials. A: Strong (1) and Weak (3) coverage for the weakly supervised conformal (WSC) and the fully supervised conformal (FSC) confidence sets. B: Density histogram of the variable Kpartial (15) governing the weak distribution in this example. C: Distribution of the confidence set size | b C(X)| for different signal-to-noise ratios σ 1. Cauchois, Gupta, Ali, and Duchi supervised conformal (WSC) procedures: we learn linear individual relevance score functions {ry}y [K] (with fully supervised training data) via the List Net procedure (Cao et al., 2007), which we briefly describe here. Given a set of relevance scores {ry}y Y RK, List Net models the probability of a ranking π SK as exp(rπ(y)) Pk l=y exp(rπ(l)) , (16) which gives each item y Y a top-1 probability (of ranking first) equal to P 1 r (y) := Pr(π(1) = y) = exp(ry) Pk l=1 exp(rl) . Given a training data set containing pairs (X, R) X RK of features/relevance scores, we learn score mappings by minimizing the log-loss of the top-1 distribution over a set F of functions {ˆry}y [K] := argmin r FY (X,R) training data k=1 P 1 R(k) log P 1 r(X)(k) In our experiment, we only observe the ranking (or even a fraction of), not the true peritem relevance scores, hence, following common practice (Cao et al., 2007), we use Ry = K the rank of the item = K Y 1(y) as a proxy for our observed item-wise relevance scores when training our model. In our experimental set up, each relevance score function ry : X R ideally estimates the true conditional mean of the oracle scores, x 7 x T θ y. Given these individual scores, we use the scoring mechanism (10) with ψ(x, y) := (y x)+ and conformalize using the strategy we describe in Section 4.2. The difference between the WSC and FSC methods is the conformalization step on calibration data: WSC runs Alg. 1 with partially supervised data to compute the score threshold btn while FSC uses strongly supervised data to return the more conservative threshold btfull n in (13). Our results fit our initial expectations, in line with our first experiments: the size of the confidence set, as Figure 4C shows, benefits from the weaker definition of coverage: for any value of the signal to noise ratio σ 1 > 0, the WSC method produces much smaller and more informative confidence sets than the FSC method, as it only needs to include a ranking with the correct first Kpartial elements to provide valid coverage. ,At the cost of the strong coverage falling below 1 α (see Fig. 4A), and with little information (only the first Kpartial < K labels), the WSC method constructs predictive sets that are much smaller and yet still valid (in a weak sense). 5.2.2 Ranking experiment with Microsoft LETOR dataset We now tackle a slightly different type of ranking problem: we wish to rank a set of potential documents by order of relevance to a specific user query: documents more relevant to the query should occupy a higher position in the final ranking. A search engine is a good example of such problem: a user makes a search query, and the task is to sort Web pages that best answer that query among a (potentially large) set of potential pages. Predictive Inference with Weak Supervision Strong Weak 2 4 5 8 10 15 0 2 5 8 0 2 5 8 factor(Score.Weight) 2 4 5 8 10 15 0 25 50 75 100 CDF P(|C(X)| <= Size) P | b C(X)| M t) P(Y b C(X)) P(W b C(X) = ) K = 2 K = 4 K = 6 K = 8 K = 10 K = 15 K = 2 K = 4 K = 6 K = 8 K = 10 K = 15 Figure 5. Results for LETOR ranking dataset (Qin and Liu, 2013) over Ntrials = 20 trials. Each plot represents a different value of K {2, 4, 8, 10, 15}, the number of documents to rank, and we compare different scoring functions by varying the value of c {0, 2, 5, 8} in equation (18). A: Strong (1) and Weak (3) coverage for the weakly supervised conformal (WSC) and the fully supervised conformal (FSC) confidence sets. B: Distribution of the confidence set size | b C(X)| for different numbers K of suggested documents. We display here the distribution of min(| b C(X)|, M) for M = 100. Cauchois, Gupta, Ali, and Duchi Learning to rank with Microsoft LETOR dataset (Qin and Liu, 2013) To study that problem, we design an experiment with Microsoft LETOR data set. For each potential query/document pair (x, d), the dataset aggregates several quantities of interest to determine whether d is relevant to x into a d = 46-dimensional feature vector φ(x, d) Rd. For a query x with a set of potentially relevant documents D(x) := {dj}|D(x)| j=1 , our data set additionally contains a ranking Π(x) S|D(x)| that orders these documents according to their relevance. Our goal is to retrieve that ranking using the feature vectors {φ(x, dj)}D(x) j=1 . A semi-synthetic weakly supervision set-up We construct weakly supervised calibration and test data sets as follows. For each split (calibration/test), we first sample n = 2000 queries from the entire set of queries in LETOR validation and test datasets. For every query Xi, we select K {2, 4, 6, 8, 10, 20} documents by first sorting D(Xi) into K equally sized subsets by relevance, so subset ℓ [K] contains the documents with rank Π(x)y for every y {(ℓ 1)|D(Xi)| K + 1, . . . , ℓ|D(Xi)| K }, and then drawing one document from each box uniformly at random. This procedure ensures that there exists a significant relevance gap between any two potential documents in the query, and that the number of documents to rank is sufficient to allow reasonably-sized confidence sets. Π(Xi) additionally induces a sub-ranking Yi SK on these documents, which we treat as a strong label. Similarly to our approach in Section 5.2.1, we introduce partial labels by assuming that our weakly supervised method can only access the first Kpartial i elements of Yi, where Kpartial i iid 1 + Poi(.5): this simulates the plausible setting where a user has given feedback on the most relevant documents to the query, but certainly not to all of them. We repeat the entire simulation procedure Ntrials = 20 times. Building a ranking scoring function (10) We next describe how we use fully supervised training data to construct the scoring function that we feed Alg. 1 with. We first learn a linear query/document relevance function rθ(x, d) := θT φ(x, d) (17) using the List Net procedure (16) on LETOR (fully supervised) train data xi, (di,j)D(xi) j=1 , yi SD(xi) ntrain containing 55700 different query/document pairs. We then use a specific implementation of the score function s Ranking as in Eqn (10): if we rank K documents {dk}k [K] for a query x, we rescale our relevance scores to the interval [0, 1], {rk(x)}k [K] := rθ(x, dk) minj [K] rθ(x, dj) maxj [K] rθ(x, dj) minj [K] rθ(x, dj) and then, for a choice of c {0, 2, 5, 8}, apply the scoring mechanism (10) with these relevance scores and pairwise comparison function ψc(r1, r2) := exp( cr1) (r2 r1)+ . (18) Predictive Inference with Weak Supervision In this example, we guarantee weak coverage if the true ranking Yi on the first Kpartial i elements coincides with either one of the predictive rankings. To keep the predictive set size small, we thus wish to ensure that it doesn t contain two rankings with the same first Kpartial i elements (as they would be redundant): this is why we introduce the exponential term exp( cr1), which makes sure that when ranking all configurations by their score, highly ranked configurations have different first elements (rather than different last elements). To estimate the distribution of | b C(X)|, we then compute the M = 100 best rankings for each query using Alg 3, and then replace the size of the confidence set by min(M, | b C(X)|)), effectively truncating it to M. Experimental results We present our results in Figure 5. The confidence sets display the behavior we expect: when the number K of items to classify is small, the fully supervised conformal (FSC) and weakly supervised conformal (WSC) methods are similar, since partial labels are often equal to strong labels. Since the overall number of configurations is small, both methods are also able to maintain fairly small confidence sets. On the other hand, when K grows, the weak supervision method quickly departs from the full supervision one, and is able to produce confidence sets that are much smaller: when K 8, the FSC method is unable to produce confidence sets with fewer than 100 configurations, as the number of configurations is large, and the problem is inherently noisy, especially for comparing documents with a fairly small relevance. The WSC (partially) overcomes that difficulty with its restrained notion of coverage, and is able to maintain a majority of confidence sets with size smaller than M, at least until K = 15. Of course, this method pays a price in terms of strong coverage, as for large K, the confidence set almost never contains the actual ground true ranking. That said, it may not a real issue as we are more interested in detecting which documents are actually relevant, and hence should have a higher rank, rather than correctly ordering documents with very little relevance to the query at the bottom of the list. In addition, as we predicted, higher values of c in the pairwise comparison function (18) produce much smaller confidence sets by favoring more diverse rankings at the top of the list. 6 Discussion The new measures of coverage we develop here tailored to partially supervised data that may be easier to collect in many engineering and measurement-centric scientific scenarios help to bridge a gap between typical conformal predictive inference methods, which require expensive supervised data, and problems with partial supervision, whose typical focus is on prediction but not uncertainty quantification. Our hope is for this paper to open several avenues for future work. First, Algorithm 1 does not currently quantify the amount of coverage it provides conditionally on the query function, which essentially means in an item ranking framework that we do not know ahead of time whether we guarantee the top 2 or top 10 elements of the ranking to be correct. This occurs first because the query function is unknown ahead of time, and second because coverage (3) is marginal over the full randomness of the sample. Similarly to conformal inference extension works bridging the gap between marginal and conditional coverage, or between marginal and label-wise Cauchois, Gupta, Ali, and Duchi coverage, one potential goal is to adapt these methods and even out coverage conditionally on (plausible) query functions. Our approach acts as a wrapper around any black box machine learning model, providing valid coverage guarantees independent of model quality. However, poorly trained models impact the efficiency of prediction sets, which can be large when training data is scarce. Thus, efforts to mitigate overfitting and train high-quality models are paramount to ensuring the efficiency of our method s prediction sets. Scalability, while generally manageable with our methods for large datasets, presents challenges primarily during the conformalization step. Recognizing the absence of a one-size-fits-all solution, we have tailored a few scalable and, we hope, exemplar methods that capture diverse applications. The new definition (3) is intrinsically a 0-1 loss-based approach, in the sense that the confidence set b Cn either covers the weakly supervised set or fails. A natural initial extension is thus similarly to what Bates et al. (2021) propose in the strongly supervised case, recognizing that many structured prediction problems (e.g., segmentation tasks or multilabel problems) benefit from more subtle and granular loss functions. In the same vein, we present a few efficient choices of scoring mechanisms for structured prediction, which highlight the practicality and potential application of our general methodology; it seems quite plausible that more sophisticated scoring models could yield substantial improvements. In our view, one of the more exciting potential applications of this work reposes on the (growing) centrality of partial and weakly labeled data in statistical learning (Ratner et al., 2017). Whether this be from partial reporting in surveys, or because collecting labeled data is quite expensive, a major challenge in modern machine learning deployments and the release of statistical models is monitoring their performance. The weaker notions of predictive inference and coverage here, we might hope to build more effective and applicable guardrails and uncertainty measures for modern statistical systems, even as they are released to the world. Appendix A. A general upper bound for the greedy approach As we saw in section 3, reasonable conditions on label distributions guarantee that the greedy mappings {Cgr η }η (0,1) solve problem (Cond), while pathologies (as in Example 3) exist. In this section, we show that even in general cases, the sizes of the confidence sets Cgr and Ccond cannot be too far apart. We motivate our approach by noting the similarity between problem (Cond) and the minimum set cover problem familiar in submodular optimization Vazirani (2001); Golovin et al. (2014), which we recall. Let f : 2Y [0, 1] be a monotone submodular coverage function, meaning that for each A B Y and y Y \ B, f satisfies f(A) f(B), f(A {y}) f(A) f(B {y}) f(B), f(Y) = 1, and f( ) = 0. A solution to the minimum set cover problem is C η argmin C Y {|C| s.t. f(C) η} . (19) A classical result combinatorial optimization of Wolsey (1982) bounds the size of the set that a natural greedy algorithm for problem (19) returns. To state the result, we introduce a bit of notation. For any set C Y and y Y, we define (C, y) := f(C {y}) f(C), Predictive Inference with Weak Supervision increase in coverage from adding y to C. At each step j [K], the greedy algorithm chooses yj := argmax y Y ({y1 . . . , yj 1}, y), and stops at the first step j(η) K such that f({y1, . . . , yj(η)}) η. For the greedy set Cgr,j := {y1, . . . , yj}, define the constant Kf,η := min ( η η f(Cgr,j(η) 1), max y Y, j j(η) (Cgr,j,y)>0 ( , y) (Cgr,j, y) , maxy Y ( , y) maxy Y\Cgr,j(η) 1 (Cgr,j(η) 1, y) We then have the following result. Lemma 22 (Wolsey (1982), Theorem 1) Let f : 2Y [0, 1] be a monotone submodular coverage function. Then |Cgr,j(η)| 1 + log Kf,η |C η| Given the apparent similarity between the problems (19) and (Cond), we would like to leverage Lemma 22 to establish a similar guarantee for Alg. 2. To apply Lemma 22 to Alg. 2, we provide the natural analogous quantities, leveraging the notation in the algorithm and working conditional on X = x. Define fx(C) := P(W C = | X = x), which is immediately a submodular coverage function, and for each x we have increment function x(C, y) = P(W C = , y W | X = x). Because the greedy sets Cgr η (x, u) may be randomized but always satisfy Cgr η (x, 1) Cgr η (x, 0), we provide a slight alternative to the constant Kf,η, defining KP,η,x := min ( η η P(W Cgr η (x, 1) = | X = x), max y Y, j j(x,η) x(Cgr,j(x),y)>0 x( , y) x(Cgr,j(x), y) , maxy Y ( , y) maxy Y (Cgr η (x, 1), y) Invoking Lemma 22 and simplifying gives the following result, which bounds the size of the greedy set by a logarithmic quantity times the size of the best (deterministic) covering set. Corollary 23 Let Cgr η : X [0, 1] Y be the confidence set mapping Algorithm 2 outputs. Then for all x X and u [0, 1], |Cgr η (x, u)| |Cgr η (x, 0)| 1 + log KP,η,x min C Y {|C| s.t. P(W C = | x) η} . We can roughly interpret the three terms inside the minimum in (20) as follows. The first term is large when the greedy algorithm nearly attains the required coverage on the iteration Cauchois, Gupta, Ali, and Duchi just before terminating, and therefore measures (in a sense) how wasteful the algorithm is. The second term is large when choosing a label earlier would have improved the coverage more, and so expresses a kind of regret. The third term measures how often the labels y Y co-occur in W. Though the bound is a functional of the discrete derivative x(C, y) and small when the local information in x(C, y) gives good indicators of globally optimal sets C, it can be hard to compute explicitly; we therefore evaluate the size of the sets that Alg. 2 generates for a few experimental examples in Section 5. Appendix B. Proofs of mathematical results B.1 Proofs of lower bounds on confidence set sizes B.1.1 Proof of Theorem 3 Suppose that P is a consistent distribution on X Y 2Y, with marginal Pweak over X 2Y, and consider a procedure b Cn offering 1 α strong distribution-free coverage. Let P be the distribution on X Y 2Y with Pweak = Pweak, and where we define P by the triple ( X, Y , W) P according to Y = argmin y W pn( X, y) := P(Xi,W)n i=1 iid Pweak h y b Cn( X) i . Then, P is a consistent distribution on X Y 2Y, which ensures that P(Xi,Yi,Wi)n+1 i=1 iid P h Yn+1 b Cn(Xn+1) i 1 α. By definition of P, we have P(Xi,Yi,Wi)n+1 i=1 iid P h Yn+1 b Cn(Xn+1) i = E(Xn+1,Yn+1,Wn+1) W [pn(Xn+1, Yn+1)] , the law of b Cn is identical under P or P, as it only depends on (Xi, Wi)n i=1 iid Pweak. On the other hand, we observe that when (Xn+1, Yn+1, Wn+1) P, pn(Xn+1, Yn+1) = inf y Wn+1 pn(Xn+1, y), which guarantees that 1 α E(Xn+1,Yn+1,Wn+1) P [pn(Xn+1, Yn+1)] = E(Xn+1,Yn+1,Wn+1) P inf y Wn+1 pn(Xn+1, y) = E(Xn+1,Wn+1) Pweak inf y Wn+1 pn(Xn+1, y) = E(Xn+1,Wn+1) Pweak inf y Wn+1 pn(Xn+1, y) . Predictive Inference with Weak Supervision B.1.2 Proof of Corollary 8 Consider (X, W) Pweak independent of (Xi, Wi)i 1, and define p(x, y) = P(y C(x)), recalling the definition pn(x, y) = P(y b Cn(x)). Then because for any set W Y and any functions f and g we have inf y W f(y) inf y W g(y) µ(W) inf y W |f(y) g(y)|µ(W) Z W |f(y) g(y)|dµ(y), we obtain by Jensen s inequality and Fubini s theorem that E inf y W p(X, y) inf y W pn(X, y) µ(W) E Z Y |p(X, y) pn(X, y)| dµ(y) 1{y b Cn(X)} 1{y C(X)} dµ(y) = E h µ b Cn(X) C(X) i . Taking the limit as E[µ( b Cn(X) C(X))] 0 as n we thus have lim n E inf y W p(X, y) inf y W pn(X, y) µ(W) = 0. By monotonicity and the assumption that µ(W) > 0 with probability 1, for any ϵ > 0 we may choose c > 0 such that P(µ(W) < c) ϵ, and thus lim n P inf y W p(X, y) inf y W pn(X, y) ϵ lim n P inf y W p(X, y) inf y W pn(X, y) ϵ, µ(W) c + P(µ(W) < c) 0 + ϵ. In particular, | infy W p(X, y) infy W pn(X, y)| p 0, and as pn and p both take values in [0, 1], Theorem 3 s conclusion that E[infy W pn(X, y)] 1 α implies E inf y W p(X, y) 1 α. As infy W p(X, y) [0, 1], this in turn implies P(infy W p(X, y) > 0) 1 α. Finally, we note the following equivalence: a target y W Det C(x)\C(x) if and only if y W {y Y : p(x, y) = 0} if and only if infy W p(x, y) = 0. That is, we have the event equalities {W Det C(x) C(x)} = inf y W p(x, y) > 0 , so that P(W Det C(X) C(X)) = P(infy W p(X, y) > 0) 1 α. B.2 Proofs on size set optimality in weak supervision B.2.1 Proof of Lemma 12 Fix η0 (0, 1). Then y Cη0(x, u) implies that snest(x, y, u) = inf{η | y Cη(x, u)} η0 and so snest(x, y, u) η0. Conversely, assume that snest(x, y, u) η0. Then by definition of snest, y Cη0(x, u) if and only if for all η > η0, we have y Cη(x, u) but y Cη0(x, u), and therefore snest(x, y, u) = η0. But of course, by continuity, P(snest(x, y, U) = η0) = 0, and so P(snest(x, y, U) η0 and y Cη0(x, U)) = 0. Cauchois, Gupta, Ali, and Duchi B.2.2 Proof of Proposition 17 The case where W | X = x has a label-independent structure is immediate, hence we focus on proving the result when W | X = x has a label tree-structure (8). We prove the result by induction on the size of Y , observing that the result is immediate if |Y | = 1. If |Y | = K > 1, we assume that the result holds on sets with at most K 1 elements. We denote Px the law of W | X = x, by Pu the law of U and P = Px Pu their joint distribution, and similarly for their expectations. Fix η (0, 1), and let C : [0, 1] Y be a confidence set mapping satisfying P(C(U) W = ) η. We will prove that Eu|CGreedy η (x, U)| Eu|C(U)|. We use the label ranking y1(x), . . . , y K(x) that Alg. 2 defines, omitting x for simplicity, and consider two cases: Case 1: Pu(y K C(U)) = 0. Then C provides coverage at level η using only the K 1 first labels, which also guarantees that CGreedy η (x, u) only contains labels in {y1, . . . , y K 1} (since, in that case, Jη K 1 in Alg. 2). The induction hypothesis applied to the distribution of W \ {y K} thus ensures that Eu|CGreedy η (x, U)| Eu|C(U)|. Case 2: Pu(y K C(U)) > 0. In that case, we will prove that either Pu(yj C(U)) = 1 for all j [K 1], or that we can build a new confidence set Cfinal(U) such that P(Cfinal(U) W = | X = x) P(C(U) W = | X = x), Eu|Cfinal(x, U)| = Eu|C(U)|, and verifies either Pu(yj Cfinal(U)) = 1 for all j [K 1], or Pu(y K Cfinal(U)) = 0. The distribution Px induces a tree whose leaves are the labels y1, . . . , y K, and each inner node N (apart from the root, which is Y itself) is a subset of Y such that Px(W = N) > 0, and two nodes N1 and N2 share the same parent if any subset C containing strictly either N1 or N2 such that Px(W = C) > 0 contains N1 N2. This parent is then the smallest subset Np such that N1 N2 Np and Px(W = Np) > 0. Each parent is then the union of all its children. Figure 1 provides an example of such a tree. Defining D(C) := {y Y \ {y K} | Pu(yj C(U)) < 1} = , we consider the element y D(C) sharing the lowest common ancestor with y K in the tree. For instance, in Figure 1, if D(C) = {y3, y4}, then y D = y3, as their common ancestor W0 is lower than the common ancestor of y5 and y4 (Y itself). We then proceed to define C(U) from C(U) so that C(U) \ {y K, y D} = C(U) \ {y K, y D} (21) Predictive Inference with Weak Supervision Eu| C(U) {y K, y D}| = Eu|C(U) {y K, y D}|, (22) but now either Pu(y D C(U)) = 1 or Pu(y K C(U)) = 0. In practice, we do so by replacing y K by y D when C(U) {y K, y D} = {y K(x)}, or/and decreasing the probabilities that C(U) {y K, y D} = {y D, y K} and C(U) {y K, y D} = , in such a way that the average size does not vary, but the probability that C(U) {y K, y D} = {y D} increases. We then proceed to check that such a change cannot hurt our coverage, while it evidently leaves the average confidence set size unchanged (because of equations (21) and (22)). The only way we can have C(U) W = and C(U) W = is if W = {y K}. On the other hand, when W = {y D}, we can have C(U) W = but C(U) W = . Because of the definition of y D with respect to y K, any other value of W such that C(U) W = will be such that C(U) W = . In particular, by independence of W and U, we have P( C(U) W = ) P(C(U) W = ) Px(W = {y K(x)}) h Pu(y K C(U)) Pu(y K C(U)) i + Px(W = {y D}) h Pu(y D C(U)) Pu(y D C(U)) i = Pu(y D C(U)) Pu(y D C(U)) (Px(W = {y D}) Px(W = {y K})) , since Pu(y K C(U)) + P(y D C(U)) = Pu(y K C(U)) + P(y D C(U)), as the total average size does not vary. In addition, since y K gets selected last in Alg. 2, we know that for all y Y , Px (W = {y D}) Px (W = {y K})) , which achieves to prove that P( C(U) W = ) P(C(U) W = ) η. If Pu(y D C(U)) = 1, then |D( C)| |D(C)| 1, and we can repeat the process until we obtain a final mapping Cfinal such that either D(Cfinal) = or P(y K(x) Cfinal(U)) = 0. In the first scenario where eventually D(Cfinal) = , it means that Cfinal(U) is either Y or Y \ {y K}, and this is immediate to check that since P(Cfinal(U) W = | X = x) η, we must have Pu(y K Cfinal(U)) Pu(y K CCond-Prox η (x, U)), which in turn ensures that Eu|CCond-Prox η (x, U)| Eu|Cfinal(U)| = E|C(U)|. Cauchois, Gupta, Ali, and Duchi In the second case, since PU(y K Cfinal(U)) = 0, Cfinal is effectively a confidence set over strictly less than K labels, in which case we can apply the induction hypothesis to conclude that Eu|CCond-Prox η (x, U)| Eu|Cfinal(U)| = Eu|C(U)|. B.3 Proofs of algorithmic validity B.3.1 Proof of Lemma 21 We prove the result by proving that if we run Algorithm 3 with M = K!, defining at each step ym j,2 as in equation (11), then at each step m K! of the algorithm, {Ym j }j [m] is a valid partition of Y that satisfies the following conditions: 1. For each j [m], we have Ym j = {ym j } if and only if there exists no i [K 1] such that (i + 1 i) ym j Ym j . 2. If Ym j = {ym j } then s(x, ym j ) s(x, ym j,2). If the partition satisfies these two conditions at every step, then we can run the algorithm until step m = K!, at which point it returns a partition {YK! j }K! k=1 such that s(x, y K! 1 ) s(x, y K! K!). Now, since we have sm j = s K! j for every 1 j m, we conclude that, at each step m [K!], we have s(x, ym 1 ) s(x, ym 2 ) s(x, ym m) min y Y\{ym 1 ,...,ym m} s(x, y), which proves the validity of the algorithm. This is of course true for m = 1, since the best configuration simply ranks rk(x) in decreasing order. 1. By definition of ym ind(m),2 and yind(m), {Ym+1 j }m+1 j=1 is a valid partition of Y such that each ym+1 j Ym+1 j , if {Ym j } is itself a valid partition (and m < K!), so long as we can prove that if Ym j = {ym j }, then there must exist α [K 1] such that (α + 1 α) ym j Ym j (i.e the algorithm does not get stuck and terminates). But this is immediate as, if for all α [K 1], we have (α + 1 α) ym j / Ym j , then it must be by construction that α [K 1] {y Y | y 1(ym j (α)) < y 1(ym j (α + 1))} = {ym j }. Therefore, at each step m K! of the algorithm, {Ym j }j [m] is a valid partition of Y, and the algorithm terminates. 2. On the other hand, it requires more care to justify why, if we set, for all j m, ym j,2 := argmin y Ym j {s(x, y) | α [K 1], y = (α + 1 α) ym j } Predictive Inference with Weak Supervision then we should always have s(x, ym j ) s(x, ym j,2) (23) for all j [m], i.e. why any permutation of the form (α α + 1) ym j that belongs to Ym j cannot strictly decrease the score s(x, y). Equation (23) actually results from a crucial property of the score function (10), which ensures that if s(x, y) < s(x, (α α + 1) y), then it must hold that ry(α)(x) < ry(α+1)(x), i.e. the elements y(α) and y(α + 1) were originally in the wrong order in y. But, since we start the partition process with y1 1 such that ry1 1(1)(x) ry1 1(k)(x), i.e with all elements in the correct order, it is straightforward to check that at any time m, there cannot exist a permutation of the form (α α + 1) ym j that also belongs to Ym j such that rym j (α)(x) < rym j (α+1)(x) : if that were the case, then there would exist l j such that ym j (α) and ym j (α + 1) were the elements exchanged at time l when creating the partition {Yl+1 i }l+1 i=1. In turn, since ym j Ym j , this would guarantee that Ym j {y Y | y 1(ym j (α)) < y 1(ym j (α + 1))}, and thus that (α α + 1) ym j / Ym j . This guarantees that any configuration (α α + 1) ym j Ym j satisfies s(x, (α α + 1) ym j ) s(x, ym j ), and thus that either Ym j = {ym j }, or s(x, ym j,2) s(x, yj), which concludes the proof. Appendix C. Further experiments C.1 Structured prediction example (Perfect matching scores and weak supervision) A matching task consists of optimally pairing elements of a bipartite graph given a feature vector x X. For example, one may wish to identify paired amino acids in protein folding (Taskar et al., 2003). We assume there exists a bipartite graph G with disjoint sets U and V of K 1 nodes; each label Y is then a perfect matching between U and V , i.e., a bijection Y Y = S(U, V ). General supervised approaches for perfect matching problems, such as structured Support Vector Machines (Tsochantaridis et al., 2004) or Adversarial Bipartite Matching (Fathony et al., 2018), generally learn pairwise score functions Cauchois, Gupta, Ali, and Duchi ϕu,v : X R for all (u, v) U V , which measure the cost of adding the edge e := (u, v) for a feature vector x X, and then output a prediction argmin y S(U,V ) u U ϕu,y(u)(x) = X u U,v V 1{v = y(u)} ϕu,v(x) an instance of minimum cost perfect matching solvable in time O(K3) with the Hungarian algorithm. To efficiently adapt this approach in the context of Alg. 1, we assume we have trained a set of pairwise score functions {ϕu,v : X R | (u, v) U V }, (e.g. using supervised training data) and wish to conformalize with partially supervised data, using the score function s Matching(x, y) := X u U,v V 1{v = y(u)} ϕu,v(x, y). (24) In a matching problem, weak supervision can arise under the form of a partial matching between subsets Ui U and Vi V of the nodes, which we write Y weak i S(Ui, Vi): each u Ui has a matching element Y weak i (u) = Yi(u) Vi. Computing the minimum partial scores (5) in Alg. 1 is then computationally efficient, as it reduces to yet another minimum cost perfect matching problem: u Ui ϕu,Y weak i (u)(x) + min y S(U\Ui,V \Vi) u U\Ui v V \Vi 1{v = y(u)} ϕu,v(x) In the matching case, Alg. 3 is equivalent to finding the M-best minimal weight perfect matching in a bipartite graph, which Chegireddy and Hamacher (1987) efficiently solve. In the context of Alg. 3, their approach iteratively chooses an edge em ym ind(m)\ym ind(m),2, then partitions the set of matchings M Ym ind(m) depending on whether em M or not. The computation of each second-best configuration then amounts to solving at most K different perfect matching problems, resulting in an overall O(MK4) cost of the procedure. C.2 Pedestrian tracking with partial matching information We now apply our weakly supervised conformal methods to a bipartite matching problem. A common objective in computer vision, relevant for instance for self-driving cars, is to track people s trajectory throughout different time frames. Since we can leverage powerful algorithms (Redmon et al., 2016) to individually detect objects in every single frame, the problem that we study here is actually a matching task where the goal is to match two sets of people appearing in two separate frames: this is an instance of a maximal matching problem. Weak supervision with partial matchings In this context, we expect partial supervision to come under the form of a partial matching: some people, e.g. people that are easier to track between two consecutive frames because they are in the foreground, already have their match in the second frame, when others, perhaps more difficult to track, are still waiting for a potential match. Given these partially labeled instances, the goal then Predictive Inference with Weak Supervision Strong Weak ETH B vs ETH P/ETH S ETH P vs ETH B/ETH S ETH S vs ETH B/ETH P ETH B vs ETH P/ETH S ETH P vs ETH B/ETH S ETH S vs ETH B/ETH P ETH B vs ETH P/ETH S ETH P vs ETH B/ETH S ETH S vs ETH B/ETH P Train/Validation split Train/Validation split P(Y b C(X)) P(W b C(X) = ) A B Figure 6. Results for the video tracking matching dataset MOT2015 Leal-Taix e et al. (2015), over Ntrials = 20 trials. We use one video sequence for training (ETH-B, ETH-P or ETH-S) and the two others for calibration and testing. A: Strong (1) and weak (3) coverage. B: Average confidence set size for fully supervised (FSC) and weakly supervised conformal (WSC) methods. becomes to return confidence sets of matchings that guarantee 1 α coverage: to provide valid weak coverage (3), we wish to include a configuration that contains at least all the partial matches. Predicting trajectories in the MOT2D15 data set (Leal-Taix e et al., 2015) We experiment using the MOT2D15 pedestrian video tracking dataset (Leal-Taix e et al., 2015). This public benchmark contains short street videos with pedestrians, and the goal is to track each of them while they appear in the frame. Each frame has a set of bounding boxes corresponding to each individual present in the frame, and we seek to match boxes representing the same person between two consecutive frames. Since an individual can enter or exit the frame between two consecutive images, we need to account for potentially unmatched boxes, which we do by including virtual nodes in the bipartite graph, similarly to previous approaches (Kim et al., 2013; Fathony et al., 2018). We use the same feature representation of Kim et al. (2013) and Fathony et al. (2018): given a pair x := (x1, x2) of two consecutive images, and two bounding boxes u x1, v x2, we compute a d = 46 dimensional vector φ(x1, x2, u, v) that summarizes key features (e.g. position of the bounding box, color distribution) and allows determining whether u and v represent the same person. We then train our model using a structured S-SVM approach (Tsochantaridis et al., 2004), following Kim et al. (2013). The model outputs a pairwise score function s PW : (x, u, v) 7 θT φ(x1, x2, u, v) for some vector θ Rd where the feature vector x = (x1, x2) X consists of two consecutive frames, and (u, v) are two potential bounding boxes (one in each image). Experimental set-up and partial labels We use MOT2D15 as follows. For each of the ETH-Bahnhof, ETH-Pedcross2 and ETH-Sunnyday video sequences, which contain respectively 1000, 837, and 354 consecutive images, we select one for training, one for calibration, and the last for testing, using α = 0.02 for conformalization purposes. We introduce weak supervision by assuming that for each pair of images, we observe a partial matching: among the Ki paired individuals, a user provides feedback on Kpartial i iid 1 + Poi(0.5) matches. Cauchois, Gupta, Ali, and Duchi For both our FSC and WSC methods, we use a translated version of the s Matching score (24) with pairwise functions su,v(x) := s PW(x, u, v): for each instance (x, y) X Y, we use the score s Matching(x, y) := s Matching(x, y) min y Y s Matching(x, y) This operation ensures that miny Y s Matching(x, y) = 0 for every x X and thus does not change the ordering of configurations or the score difference between two configurations; we simply use it to place all the instances Xi on the same scale when applying Algorithm 1. In particular, in the noiseless case where the true label Yi is always the minimizer of y 7 s Matching(Xi, y), this guarantees that b C(X) eventually contains a single configuration (as we should, since the score function in this case outputs perfect predictions). Experimental results This specific problem is actually low-noise, as it is possible to achieve a very high accuracy with the S-SVM approach, which is not so surprising as we assume perfect detection of every person thanks the bounding boxes. As a result, we can expect the FSC and WSC methods to output very similar confidence sets, as the configuration minimizing the score is often the true label itself. This is precisely what we observe in Figure 6, where both methods are actually indistinguishable and where, even with a very high confidence 1 α. = 0.98, both the FSC and WSC methods return confidence sets with a single configuration on average. We only notice a slight difference between both methods when training on the ETH-Sunnyday sequence, which contains fewer images, and hence produces slightly worse score functions. C.3 Prediction intervals for weakly supervised regression Much of our development goes beyond (finite) spaces with combinatorial structure. We therefore finish with an exemplar regression problem. We consider predicting the fraction of votes in each United States county for the Democratic Party candidate in the 2016 United States presidential election, using demographic (census) data as covariates and the results of past elections. It is common during elections for forecasters to build predictive models from both census and historical election data, as well as current polling data. We view the historical data as strong supervision (it tells us exactly how many people voted for each candidate), and the polling data as weak supervision (as polls always come with a margin of error). Our goal here is to fit a regression model to the strongly supervised past election data, and then form prediction intervals for the fraction of people in each county who voted Democrat by leveraging the weakly supervised polling data. We hope by combining both we obtain valid intervals narrower than the polling margins of error. Our data comes from the 2013 2017 American Community Survey 5-Year Estimates, a longitudinal survey that records demographic information about each of the 3220 United States counties. We use d = 34 of the available demographic features, and the response is the fraction of people in each county who voted Democratic. To experiment with this dataset, we split it into thirds: 33% of the counties (and their associated fractions of Democratic voters) go into the training set, 33% go into the calibration set, and the rest go into the test set; as our splits are random, they are exchangeable. We fit a Beta regression model to the strongly supervised training data. To simulate the availability of weakly supervised polling Predictive Inference with Weak Supervision data, we replace each calibration set response Yi [0, 1] with a weak response Wi [0, 1], i = 1, . . . , n, by forming intervals Wi = [Yi Zi, Yi + Zi], Zi N(µ, 0.0001), i = 1, . . . , n, (25) for various values of µ {.01, .05, .1, .15, .2}, so that Wi captures fluctuations of roughly µ around Yi. We conformalize by running Alg. 1 with the absolute error score s(x, y) = |ˆy(x) y|, where ˆy(x) [0, 1] denotes the Beta regression model s prediction for the point x Rd. Here, conformalization boils down to solving the simple linear program s(Xi, Yi) = min γ R {|ˆy(Xi) γ| | Yi Zi γ, γ Yi + Zi} , i = 1, . . . , n. Finally, we evaluate both strong (1) and weak (3) coverage on the test set, applying the same transformation (25) to generate the weak labels for the test set. We compute the two types of coverage, as well as the lengths of the prediction intervals, by repeating this process 20 times. We set the miscoverage level α = .05. Similar to our other experiments, Alg. 1 achieves weak coverage at the nominal level .95 for all values of µ (governing the amount of weak supervision), as in the middle panel of Figure 7. We expect the strong coverage to be lower. The left panel of Figure 7 uses compares the strong coverage Alg. 1 achieves in teal, showing the coverage of standard conformal inference (using the correct responses Yi) in pink. Because it provides weak coverage, we expect Alg. 1 to generate shorter prediction intervals than standard conformal inference. The right panel of Figure 7 exhibits this: when µ .1, the average length of Alg. 1 s intervals is at least three times smaller than standard conformal s, and half the length of the average weakly supervised interval Wi from (25) ( .2). We can also see from these figures, as in our other experiments, that Alg. 1 s strong coverage degrades as µ grows, whereas its weak coverage improves and the length of its prediction intervals shrinks. We view these results from a slightly different perspective in Figures 8 and 9. In Figure 8, we show the true fraction of Democratic votes in each county in the test. In Figure 9, we show the lower and upper endpoints of Alg. 1 s prediction intervals, for a randomly chosen repetition with µ = .05. In these two figures, we color the counties with strong (predicted) Democratic majorities blue, and those with strong (predicted) Republican majorities red. By comparing the colors, we can see that the prediction intervals only sometimes contain the true response, which is expected. Finally, we note that the colors of the lower and upper endpoints in Figure 9 are similar, because the length of the prediction intervals is usually small. Cauchois, Gupta, Ali, and Duchi 0.01 0.05 0.1 0.15 0.2 0.01 0.05 0.1 0.15 0.2 0.01 0.05 0.1 0.15 0.2 mu Average Interval Size Figure 7. Results for the regression problem with the election data over 20 trials. Left panel: strong coverage (1). Middle panel weak coverage (3). The dashed red line indicates the nominal coverage level, 1 α = .95. Right panel: prediction interval lengths. In these plots, we show Alg. 1, denoted WSC , in teal. We show standard conformal inference, denoted FSC , in pink. Figure 8. Map of United States counties. We color each county according to the actual fraction votes for the Democratic candidate in the 2016 United States presidential election. We color counties with strong Democratic majorities blue, and those with strong Republican majorities red. We color the counties from the training and calibration sets gray. Predictive Inference with Weak Supervision Figure 9. Map of United States counties. We color each county according to the value of the upper (top panel) and lower (bottom panel) endpoints of the confidence interval that Alg. 1 returns, when µ = .05. We color counties with values close to 1 blue, and those with values close to 0 red. We color the counties from the training and calibration sets gray. Cauchois, Gupta, Ali, and Duchi N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. Journal of the Association for Computing Machinery, 55(5), 2008. A. Angelopoulos, S. Bates, J. Malik, and M. I. Jordan. Uncertainty sets for image classifiers using conformal prediction. ar Xiv:2009.14193 [cs.CV], 2020. R. F. Barber, E. J. Cand es, A. Ramdas, and R. J. Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference, 10(2):455 482, 2021. S. Bates, A. Angelopoulos, L. Lei, J. Malik, and M. I. Jordan. Distribution-free, riskcontrolling prediction sets. Journal of the Association for Computing Machinery, 68(6): 43:1 43:34, 2021. M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classification. Pattern Recognition, 37(9):1757 1771, 2004. V. Cabannes, A. Rudi, and F. Bach. Structured prediction with partial labelling through the infimum loss. In Proceedings of the 37th International Conference on Machine Learning, 2020. Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, pages 129 136, 2007. M. Cauchois, S. Gupta, A. Ali, and J. Duchi. Robust validation: Confident predictions even when distributions shift. ar Xiv:2008.04267 [stat.ML], 2020. M. Cauchois, S. Gupta, and J. Duchi. Knowing what you know: valid and validated confidence sets in multiclass and multilabel prediction. Journal of Machine Learning Research, 22(81):1 42, 2021. C. R. Chegireddy and H. W. Hamacher. Algorithms for finding k-best perfect matchings. Discrete Applied Mathematics, 18(2):155 165, 1987. J. Cid-Sueiro, D. Garc ıa-Garc ıa, and R. Santos-Rodr ıguez. Consistency of losses for learning from weak labels. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pages 197 210, 2014. T. Cour, B. Sapp, and B. Taskar. Learning from partial labels. Journal of Machine Learning Research, 12:1501 -1536, 2011. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Image Net: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248 255, 2009. D. L. Donoho. 50 years of data science. Journal of Computational and Graphical Statistics, 26(4):745 766, 2017. Predictive Inference with Weak Supervision J. C. Duchi, L. Mackey, and M. I. Jordan. The asymptotics of ranking algorithms. Annals of Statistics, 41(5):2292 2323, 2013. B.-S. Einbinder, Y. Romano, M. Sesia, and Y. Zhou. Training uncertainty-aware classifiers with conformalized deep learning. In Advances in Neural Information Processing Systems 35, 2022. A. Elisseeffand J. Weston. A kernel method for multi-labeled classification. In Advances in Neural Information Processing Systems 14, 2001. R. Fathony, S. Behpour, X. Zhang, and B. Ziebart. Efficient and consistent adversarial bipartite matching. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1457 1466, 2018. E. R. Fernandes and U. Brefeld. Learning from partially annotated sequences. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pages 407 422, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. A. Fisch, T. Schuster, T. Jaakkola, and R. Barzilay. Efficient conformal prediction via cascaded inference with expanded admission. In Proceedings of the Ninth International Conference on Learning Representations, 2021. Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. Efficient boosting algorithms for combining preferences. Journal of Machine Learning Research, 4:933 969, 2003. S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, E. Orgad, R. Entezari, G. Daras, S. Pratt, V. Ramanujan, Y. Bitton, K. Marathe, S. Mussmann, R. Vencu, M. Cherti, R. Krishna, P. W. Koh, O. Saukh, A. Ratner, S. Song, H. Hajishirzi, A. Farhadi, R. Beaumont, S. Oh, A. Dimakis, J. Jitsev, Y. Carmon, V. Shankar, and L. Schmidt. Data Comp: In search of the next generation of multimodal datasets. In Advances in Neural Information Processing Systems 36, 2023. D. Golovin, A. Krause, and M. Streeter. Online submodular maximization under a matroid constraint with application to learning assignments. ar Xiv:1407.1082 [cs.LG], 2014. C. Gupta, A. K. Kuchibhotla, and A. K. Ramdas. Nested conformal prediction and quantile out-of-bag ensemble methods. Pattern Recognition, 127, 2022. Special Issue on Conformal and Probabilistic Prediction with Applications. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. Y. Hechtlinger, B. P oczos, and L. Wasserman. Cautious deep learning. ar Xiv:1805.09460 [stat.ML], 2019. E. H ullermeier, J. F urnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16 17):1897 -1916, Nov. 2008. Cauchois, Gupta, Ali, and Duchi J. G. Kemeny. Mathematics without numbers. Daedalus, 88(4):577 591, 1959. M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81 93, 1938. S. Kim, S. Kwak, J. Feyereisl, and B. Han. Online multi-target tracking by large margin structured learning. In Proceedings of the Asian Conference on Computer Vision, pages 98 111, 2013. A. Korba, A. Garcia, and F. d Alch e Buc. A structured prediction approach for label ranking. In Advances in Neural Information Processing Systems 31, pages 8994 9004, 2018. L. Leal-Taix e, A. Milan, I. Reid, S. Roth, and K. Schindler. MOTChallenge 2015: Towards a benchmark for multi-target tracking. ar Xiv:1504.01942 [cs.CV], 2015. J. Lei. Classification with confidence. Biometrika, 101(4):755 769, 2014. J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society, Series B, 76(1):71 96, 2014. J. Lei, M. G Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523): 1094 1111, 2018. S. Mayhew, S. Chaturvedi, C.-T. Tsai, and D. Roth. Named entity recognition with partially annotated training data. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 645 655, 01 2019. S. Negahban, S. Oh, and D. Shah. Iterative ranking from pair-wise comparisons. Operations Research, 65(1):266 287, 2016. N. Nguyen and R. Caruana. Classification with partial labels. In Proceedings of the 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 551 559, 2008. G. Papandreou, L. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the International Conference on Computer Vision, pages 1742 1750, 2015. T. Qin and T. Liu. Introducing LETOR 4.0 datasets. ar Xiv:1306.2597 [cs.IR], 2013. A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R e. Snorkel: rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269 282, 2017. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition, pages 779 788, 2016. Y. Romano, R. Barber, C. Sabatti, and E. J. Candes. With malice towards none: Assessing uncertainty via equalized coverage. ar Xiv:1908.05428 [stat.ME], 2019a. Predictive Inference with Weak Supervision Y. Romano, E. Patterson, and E. J. Cand es. Conformalized quantile regression. In Advances in Neural Information Processing Systems 32, 2019b. Y. Romano, M. Sesia, and E. J. Cand es. Classification with valid and adaptive coverage. In Advances in Neural Information Processing Systems 33, 2020. M. Sadinle, J. Lei, and L. Wasserman. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223 234, 2019. D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems 28, 2015. B. Taskar. Learning Structured Prediction Models: A Large Margin Approach. Ph D thesis, Stanford University, 2005. B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in Neural Information Processing Systems 16, 2003. R. J. Tibshirani, R. F. Barber, E. J. Cand es, and A. Ramdas. Conformal prediction under covariate shift. In Advances in Neural Information Processing Systems 32, 2019. B. Triggs and J. Verbeek. Scene segmentation with crfs learned from partially labeled images. In Advances in Neural Information Processing Systems 21, volume 20, pages 1553 1560, 2008. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the Twenty-First International Conference on Machine Learning, 2004. B. van Rooyen and R. C. Williamson. A theory of learning with corrupted labels. Journal of Machine Learning Research, 18(228):1 50, 2018. A. van Zuylen, R. Hegde, K. Jain, and D. P. Williamson. Deterministic pivoting algorithms for constrained ranking and clustering problems. In Proceedings of the Eighteenth ACMSIAM Symposium on Discrete Algorithms (SODA), pages 405 414. Society for Industrial and Applied Mathematics, 2007. V. V. Vazirani. Approximation Algorithms. Springer, 2001. V. Vovk, A. Grammerman, and G. Shafer. Algorithmic Learning in a Random World. Springer, 2005. L. A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2:385 393, 1982. H.-F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 593 601, 2014. Cauchois, Gupta, Ali, and Duchi C. Zhang, C. R e, M. Cafarella, C. De Sa, A. Ratner, J. Shin, F. Wang, and S. Wu. Deep Dive: Declarative knowledge base construction. Communications of the ACM, 60(5):93 102, 2017.