# learning_from_weak_labelers_as_constraints__7ce01f10.pdf Published as a conference paper at ICLR 2025 LEARNING FROM WEAK LABELERS AS CONSTRAINTS Vishwajeet Agrawal*, Rattana Pukdee*, Maria-Florina Balcan, Pradeep Ravikumar Carnegie Mellon University {vishwaja, rpukdee, ninamf, pkr}@cs.cmu.edu We study programmatic weak supervision, where, in contrast to labeled data, we have access to weak labelers, each of which either abstains or provides noisy labels corresponding to any input. Most previous approaches typically employ latent generative models that model the joint distribution of the weak labels and the latent true label. The caveats are that this relies on assumptions that may not always hold in practice, such as conditional independence assumptions over the joint distribution of the weak labelers and the latent true label, and more general implicit inductive biases in the latent generative models. In this work, we consider a more explicit form of side information that can be leveraged to denoise the weak labeler, namely the bounds on the average error of the weak labelers. We then propose a novel but natural weak supervision objective that minimizes a regularization functional subject to satisfying these bounds. This turns out to be a difficult constrained optimization problem due to discontinuous accuracy bound constraints. We provide a continuous optimization formulation for this objective through an alternating minimization algorithm that iteratively computes soft pseudo labels on the unlabeled data satisfying the constraints while being close to the model, and then updates the model on these labels until all the constraints are satisfied. We follow this with a theoretical analysis of this approach and provide insights into its denoising effects in training discriminative models given multiple weak labelers. Finally, we demonstrate the superior performance and robustness of our method on a popular weak supervision benchmark. 1 INTRODUCTION Acquiring high-quality labeled data, which is critical for supervised learning, is often very expensive. The burgeoning field of programmatic weak supervision addresses this caveat by training models from weak labelers, which are functions that either abstain or provide noisy labels corresponding to any input (Zhang et al., 2022). Such weak labelers could range from domain-expertspecified functions to rule-based systems, heuristics, or even pre-trained models. There are two main challenges in learning a predictor using only weak labelers: each weak labeler covers only part of the input space (due to abstention) and it is typically noisy. Thus, one has to generalize beyond the coverage regions, as well as denoise within them. To learn from weak labelers, most prior works draw from developments in related fields such as crowdsourcing (Estell es-Arolas & Gonz alez-Ladr on-de Guevara, 2012; Wazny, 2017), employing latent generative models that model the joint distribution of the weak labels and the latent true label. Conditional inference on the true label then provides a natural approach to aggregate multiple noisy labels into less noisy pseudo-labels which provide an estimate of the true label. However, this relies on the implicit inductive bias since non-parametric mixture models are known to not be identifiable without any further assumptions (Kasahara & Shimotsu, 2014; Aragam et al., 2020). For example, Snorkel (Ratner et al., 2017; 2018) imposes both (a) conditional independence assumptions among the weak labelers and the true label (b) parametric assumptions on the latent generative model. However, these assumptions might not be applicable and are, in general, difficult to validate with domain experts. In addition, the denoising step does not take advantage of modern hypothesis classes such as deep neural networks. Instead, it is common to follow a two-staged approach: denoising step to estimate pseudo-labels and then train a neural network using these pseudo-labels. *Equal contribution. Published as a conference paper at ICLR 2025 We also note that this learning from weak labelers is closely related to a subfield of learning with noisy labels that also faces a denoising problem (Angluin & Laird, 1988; Blum et al., 1998; Natarajan et al., 2013). Here, we assume that a noisy label is obtained by passing the unknown true label through a noise channel that is usually assumed to be independent of the input (Natarajan et al., 2013). There is also a general noise model that could depend on the input called Massart noise (Massart & N ed elec, 2006) which is known to lead to a computationally hard learning problem (Diakonikolas et al., 2019a; 2022a). A recent line of work shows that we can make headway in this difficult setting, provided we have additional side information such as the marginal distribution is structured (Awasthi et al., 2015; 2016; Diakonikolas et al., 2022b). There have been considerable recent advances in learning under such noise models and other difficult noise models such as the malicious noise model (Awasthi et al., 2014; 2017; Diakonikolas et al., 2018; 2019b; Prasad et al., 2018). However, we remark that results from learning with noisy labels are not directly applicable to our setting since the levels of noise induced by our weak labelers do not satisfy the amount of structure needed by these works. Regardless, we believe that programmatic weak supervision can considerably benefit from connections with the field of learning from noisy labels and that it forms a rich area of research. As our first step towards leveraging such a connection, we consider the following natural and explicit inductive bias that domain experts might provide, a bound on the accuracy of each weak labeler (similar to a bound on the fraction of corrupted data in the learning from noisy label setting), where accuracy can either be with respect to the target Bayes optimal classifier or with respect to the underlying stochastic label distribution. Our first contribution is to develop a constraint-based approach to aggregate weak labelers by viewing the accuracy bound as a constraint, and accordingly impose this constraint on the model to be learned. We pose the problem as learning the simplest model (with respect to some regularization functional) that satisfies all of these constraints (Section 2). This objective is however difficult along multiple facets: it is a constrained optimization problem, with discontinuous accuracy bound constraints. While there is a rich literature on continuous convex surrogates of classification accuracy (or error), they are not immediately applicable in our setting since a bound on the accuracy does not translate to a bound on any particular surrogate loss, and recall that it is these accuracy bounds that form our domain-expert provided side information. Our second contribution is a scalable optimization algorithm. This is based on a projection function that projects the model onto the set of all functions that satisfy the constraints. First, we show that the projection objective can be reduced to a linear program when the error bounds are given with respect to the target Bayes optimal classifier (Section 3.1), and a convex optimization problem when they are given with respect to the target label distribution (Section 3.2). The convex optimization problem for the latter case results in an efficient and robust algorithm that can also be used as an approximation when the constraints are given with respect to the target classifier. Our third contribution is a rigorous analysis of our constrained estimator (Section 4). We note that the traditional statistical learning theoretic tools are not applicable since we have no labeled data. We provide two tools to bound the error of the learned classifier: i) an agreement region based analysis ii) corrected triangle inequality. Both approaches involve disparate denoising coefficients that track the benefit of having multiple rather than a single weak labeler. We further refine this to provide a tighter extension of the bounds from the limited support to the overall space using a subtle Lipschitz notion that might be of independent interest even outside the context of this paper. Lastly, we demonstrate the robustness and competitive performance of our method on a popular weak supervision benchmark (Zhang et al., 2021) in Section 5. 2 LEARNING FROM WEAK LABELER CONSTRAINTS Setup: We consider the general setting of multi-class classification, with input space X, output space Y = {1, . . . , K}, joint distribution P over X Y, and marginal distribution over X denoted by PX which we assume to be continuous. We use uppercase letters (e.g., X) to represent random variables and lowercase letters (e.g., x) for deterministic variables. For a vector x Rd, we denote xk as the kth entry of x. Let p (x) = p(y|x) RK be the conditional distribution of y given x, and let f (x) = arg maxk p (x)k denote the Bayes optimal classifier. Our goal is to learn a classifier f : X Y that minimizes the expected misclassification error with respect to the Bayes optimal classifier f , i.e. minimizes Pr(f(X) = f (X)), or the error with respect to the stochastic label Published as a conference paper at ICLR 2025 itself: Pr(f(X) = Y ). In our setting, we assume that we only have access to unlabeled data drawn i.i.d. from PX and weak labelers {gj}j [m], where each weak labeler gj : X Y { } maps any point in the instance space to the label space or abstains from prediction (denoted by ). We denote the coverage set of the weak labeler by Sj := {x X | gj(x) = }. We consider formulating a classifier from a smooth real-valued function h : X RK from some hypothesis class F e.g. neural networks. This function induces a conditional distribution denoted by ph, which we obtain by applying the softmax function, ph(x)k = exp(h(x)k)/ PK i=1 exp(h(x)i). For any conditional distribution p such that arg maxy p(x)y is unique for each x, we denote the corresponding classifier as clf(p) where clf(p)(x) = arg maxk p(x)k. Accuracy bounds and the corresponding constraints: We assume we have additional side information about the average error of each weak labeler gj on its coverage set Sj. First, this can be an error bound with respect to the Bayes optimal classifier f given by ρclf(gj, f ) := Pr X PX(gj(X) = f (X)|X Sj) ηj. (1) Since our target Bayes optimal classifier f only differs from gj by a fraction of at most ηj on Sj, a natural learning strategy is to encourage our hypothesis to also have the same behavior (differ from gj at most ηj). This leads to a constraint ρclf(gj, clf(pf)) ηj. We propose to learn the simplest hypothesis f (with respect to a regularization functional R(f)) that satisfies all of these constraints which can be formalized as the following objective with constraints on classifier outputs: min f F R(f) s.t. ρclf(gj, clf(pf)) ηj, j [m]. (2) On the other hand, if we may instead have access to an error with respect to the stochastic labels sampled from p , the bound is given by ρdist(gj, p ) := Pr X PX,Y p (X) (gj(X) = Y |X Sj) ηj. (3) Here gj is a classifier but Y is a random variable distributed according to p (X), so the probability that gj(x) is not equal to y for each x is given by P k =gj(x) p (x)k, thus ρdist(g, p ) EX PX[P k =gj(x) p (x)k]. Similar to before, we have the following objective with constraints on classifier probabilistic labels: min f F R(f) s.t. ρdist(gj, pf) ηj, j [m] (4) 3 ALGORITHMS In this section, we will first examine an algorithm for solving the learning objective with constraints on classifiers (equation 2) through an alternating minimization algorithm. This requires a projection step that projects a conditional distribution into the set of conditional distributions that satisfy the constraints. We will describe a method for estimating this projection and show that it can be reduced to a linear program. Finally, we will extend our algorithm to solve the objective with constraints on the distribution (equation 4). We provide proofs for results in this section in Appendix B. 3.1 LEARNING OBJECTIVE WITH CONSTRAINTS ON CLASSIFIER The key challenge in solving equation 2 is that it is a constrained optimization problem and these constraints are not continuous. As a key initial step, we define a projection loss of a conditional distribution p to a class of classifiers that satisfies the constraint. We denote a set of all conditional distributions that satisfy the constraints as j Qj,clf, where Qj,clf = {q : X K | ρclf(gj, clf(q)) ηj}. (5) We consider a projection loss from any conditional distribution p to a set of conditional distributions Q given by, d(p, Q) = inf q Q EX[DKL(q(X)||p(X))] (6) where DKL is the KL-divergence. We can show that d(p, Qclf) is zero if and only if p Qclf. Published as a conference paper at ICLR 2025 Lemma 3.1. For any p such that clf(p) is well defined, i.e. arg maxy p(x)y is unique for each x, we have d(p, Qclf) = 0 if and only if p Qclf. From Lemma 3.1, we may replace the constraints in equation 2 with the projection loss instead. This would still lead to the same optimal solution. We have an objective min f F R(f) s.t. d(pf, Qclf) = 0 (7) We consider a relaxation of this objective into an unconstrained optimization problem, min f F R(f) + α d(pf, Qclf) (8) where α trades off satisfying the constraints with a notion of simplicity of distribution specified through the regularization. This relaxed objective is now more suitable for a gradient-based optimization since it is continuous. We can rewrite this objective as min q Qclf,f F L(f, q) := R(f) + αEX[DKL(q(X)||pf(X))] (9) This is a minimization problem in f F, q Qclf for which we propose to use an alternating minimization algorithm where we keep track of a sequence of (ft, qt) and update as follows: 1. ft+1 = arg min f F R(f) + αEX[DKL(qt(X)||pf(X))] (10) 2. qt+1 = arg min q Qclf EX[DKL(q(X)||pft+1(X))] (11) We can show that this algorithm leads to a sequence of (ft, qt) that is decreasing in L(ft, qt): Lemma 3.2. A sequence (ft, qt) with an update in equation 10 and equation 11 satisfies L(ft+1, qt+1) L(ft, qt). Proof. L(ft+1, qt+1) L(ft+1, qt) from minimization in equation 11 and L(ft+1, qt) L(ft, qt) from minimization in equation 10. Solving step 1 (equation 10). It can be seen that the KL term in equation 10 is simply the crossentropy loss with respect to the pseudo soft labels given by q. Thus this first step of the alternating minimization algorithm simply fits a regularized classifier to these pseudo labels. Further, one does not need to estimate the exact minima at each step, as long as the next iterate reduces the objective, i.e. L(ft+1, qt+1) < L(ft, qt). Thus one can take a single or few steps of gradient descent in equation 10 to get ft+1 from ft. Solving step 2 (equation 11). This step of the alternating minimization algorithm computes the soft pseudo labels on the training data so that the constraints are satisfied. This is a much more involved step, which we detail in the following section, where we show that estimating the projection qt can be reduced to a linear program. Before doing so, we first remark that the constraints on classifiers may not be robust to a small perturbation. For example, if one has a uniform conditional distribution punif = (1/K, . . . , 1/K), we can make a small perturbation p of size at most ϵ such that clf(punif + p) = k for any label k. This implies that there exists a conditional distribution p such that d(p, Q) is small but clf(p) can be arbitrary. Lemma 3.3. Let Q = {q X K | clf(q) H} for some set of classifiers H. For any ϵ > 0 and any classifier f, there exists a conditional distribution p such that clf(p) = f and d(p, Q) < ϵ. To alleviate this problem, we enforce an additional constraint by making sure that q puts a higher probability on one label than the rest, Λϵ := {q X K | for each x, y : q(x)y (1 + ϵ)q(x)k for k = y} (12) With this additional constraint, we can show that d(p, Q) bounds the error in satisfaction of the classifier constraints, and further show that the projection problem can be solved efficiently. Proposition 3.4. For any ϵ, ν > 0 there exists δ(ϵ, ν) > 0 such that if d(p, Λϵ Qclf) < δ(ϵ, ν) then ρclf(gj, clf(p)) ηj + ν/|Sj|, where |Sj| = PX(X Sj) is the size of coverage set Sj of weak labeler gj with respect to marginal distribution PX. Published as a conference paper at ICLR 2025 Proposition 3.5. Given Q = Λϵ Qclf for Λϵ defined in equation 12 and Qclf defined in equation 5, for any conditional distribution p, the optimal projection q = arg minq Q E[DKL(q(X)||p(X))] can be found as follows, 1. Solve for optimal clf(q ) as clf(q ) = arg min h X Y EX[Proj KL(p(X); K h(X),ϵ)] : ρclf(gj, h) ηj (13) where K y,ϵ = r K | ry (1 + ϵ)rk , k = y , Proj KL(u; S) = min r S DKL(r||u). (14) 2. Solve for an optimal distribution q given the optimal clf(q ). For each x, q (x) = arg min r DKL(r||p(x)) : r K,ϵ clf(q )(x) (15) A key ingredient is the subset of the simplex K y,ϵ, comprising probabilities that are more confident about the label y with some margin ϵ. For any u K, and label y [K], Proj KL(u; K y,ϵ) is a convex optimization problem since the objective DKL(r||u) is convex in r and the constraints in K y,ϵ are linear. While we can use a black box convex optimization solver, we solve this more efficiently by reparameterizing into an unconstrained objective and using a fast heuristic approximation (see Appendix D.2 for more details). Given c(x, y) = Proj KL(p(x); K y,ϵ) for each x, y, the optimization in step 1 can be framed as an integer linear programming (ILP) problem by using a one-hot representation of the optimizing variable h as µ : X {0, 1}K, with µ(x)y = 1 for y = h(x) and µ(x) = 0 otherwise. The objective then becomes EX[PK y=1 µ(X)yc(X, y)] and the constraint ρclf(gj, h) ηj becomes y =gj(X) µ(X)y] ηj. For each x, we have an additional constraint PK y=1 µ(x)y = 1. Both the objective and constraints are linear in µ, resulting in an ILP. ILPs are usually inefficient to solve, however, we find empirically that the LP relaxation works very well for this class of ILPs (see Appendix D.1 for more details). One issue with solving the ILP is that it can still be slow even with LP relaxation since it involves one constraint for each unlabeled data x. We also can t take a small batch size when using a stochastic gradient descent since the constraints involve the population means of the weak labeler accuracies. Finally, this is not robust to misspecification of the error bounds, as the constraints might become infeasible. In the next section, we propose our method for the learning objective with constraints on distribution (equation 4). Interestingly, we found that it can also be viewed as an approximation of the algorithm presented in this section, yet it overcomes issues of efficiency and robustness, resulting in a more practical solution that is effective in both settings. 3.2 LEARNING OBJECTIVE WITH CONSTRAINTS ON DISTRIBUTION Let Qdist be the set of conditional distributions that satisfy the constraints on distribution, j Qj,dist, where Qj,dist = {q : X K | ρdist(gj, q) ηj}. (16) Now, with a similar proof, we would also have a version of Lemma 3.1 for the constraints on distribution, and further d(p, Qdist) bounds the error in satisfying the constraints without the need of imposing any additional constraint. Proposition 3.6. For Qdist defined in equation 16, we have d(p, Qdist) = 0 if and only if p Qdist. Furthermore, for any ν > 0 there exists δ(ν) > 0 such that if d(p, Qdist) < δ(ν) then ρdist(gj, p) ηj + ν/|Sj|, where |Sj| = PX(Sj) is the size of coverage set Sj of weak labeler gj with respect to marginal distribution PX. This allows us to derive the same alternating minimization algorithm as in equation 10 and equation 11. While the first step is identical, the main benefit of working with the constraints on distribution is that the second projection step can be directly posed as a convex optimization problem. min q:X7 |K| E[DKL(q(X)||p(X))] s.t. E[q(X)gj(X)|x Sj] 1 ηj, j [m]. (17) We can also derive the optimal solution of equation 17. Published as a conference paper at ICLR 2025 Proposition 3.7. Given Qdist as defined in equation 16, for any conditional distribution p, the optimal projection q = arg minq Qdist E[DKL(q(X)||p(X))] is given by q = pλ where pλ(x)y := Zλ(x) exp( X j:gj(x)=y λ)p(x)y (18) Zλ(x) is the normalization constant. The optimal λ is given by the following optimization: λ = min λ 0 j=1 λj s.t. ρdist(gj, pλ) ηj j [m] (19) The optimization for λ can be solved by a coordinate-wise alternating minimization. Given the current iterate λ, pick some j, and update λj as λj = min λ 0 λ s.t. ρdist(gj, pλ) = 1 E[Zλ(x) exp(λ)pλ j(X)gj(X)] ηi (20) where λ j = (λ1, . . . λj 1, 0, λj+1, . . . λm). Note that while optimizing for a particular coordinate λj, we only consider the constraint associated with gj. This can be seen as a relaxation of the true constraint, which is an intersection of m constraints. While this does not ensure satisfaction of all of the constraints as the alternating minimization proceeds, it lends us a simple method to solve this by a Newton-Raphson root-finding method as follows: Reset: λi = 0 Repeat until convergence: λi λi + log(1 + (1 ηi τi(hi))+/τi(hi(1 hi))), hi(x) = pλ(x)gi(x) τi(h) = E[h(X)|X Si] For derivations and proofs, see appendix B. For computational efficiency, we also propose a single parallel update for all λi instead of performing the alternating minimization sequentially by using a single step of the above Newton-Raphson method parallelly across all i given the starting point λ = 0 as: λi = log(1 + (1 ηi τi(pi))+/τi(pi(1 pi))), (pi(x) = p(x)gi(x)). (22) This does not give an exact projection of p on Qdist, but can be thought of as an approximate projection that finds a q that is closer to Qdist than p is to Q, on average. Overall, we summarize our learning algorithm (equation 10, equation 11) with our proposed projection step below, Summary of our learning algorithm. We have an alternating minimization algorithm where we track the pair (ft, qt). Here, ft : X RK is a real-valued function from our hypothesis class (e.g., neural networks), and qt represents its projection onto the set of distributions that satisfy our constraints. At each iteration t, we perform the following steps 1. Estimate qt+1 given ft. (a) Calculate a conditional distribution given by ft with a softmax function σ, pft(x) = softmax(ft(x)) K (23) (b) We use a one-step parallel Newton-Raphson update to calculate λj for each weak labeler j (equation 22). For simplicity, for each unlabeled data xi for i = 1, . . . , n, we denote the gj(xi)th coordinate of pft(xi) as zij := pft(xi)gj(xi) then λj = log(1 + (1 ηj 1 i:gj(xi) = zij)+/( 1 i:gj(xi) = (zij)(1 zij))), (24) cj is the number of data points that gj does not abstain, cj = Pn i=1 1[gj(xi) = ]. (c) Finally, we use λ to calculate qt+1(x). We first compute the unnormalized version of qt+1, qt+1(xi)k = pft(xi)k exp( X j:gj(xi)=k λj). (25) Then qt+1(xi) = qt+1(xi)/ PK k=1 qt+1(xi)k. Published as a conference paper at ICLR 2025 2. Estimate ft+1 given qt+1. This step is equivalent to fitting a ft+1 with respect to the soft pseudo labels given by qt+1. Our loss of ft is given by L(ft; qt+1) = R(ft) + α 1 i=1 ℓCE(ft(xi), qt+1(xi)) where ℓCE is a cross-entropy loss. We then update ft+1 with a gradient descent; ft+1 = ft ν L where ν is a learning rate. We also provide a compact version of our algorithm in Algorithm 1 in Appendix C. Finally, we remark that the sets Qj,dist can also be viewed as a convex approximation of the constraint sets Qj,clf (see Appendix E for more details). Therefore, the method developed in this section is also applicable (as an approximation) for the setting of constraints on classifiers but it is more efficient and also robust to misspecification in error bounds ηj compared to the algorithm in Section 3.1. 4 THEORETICAL ANALYSIS OF THE OBJECTIVE In this section, we provide some theoretical insights into the denoising effect of weak labeler constraints. We focus on the setting of constraints on classifiers (estimation objective equation 2). We learn a conditional distribution pf and thus correspondingly a classifier clf(pf) that minimizes a regularization functional R(f) subjected to weak labeler constraints. Minimizing R(f) implicitly defines a hypothesis class of classifiers as F = {clf(pf) : R(f) < δ} for some δ > 0. Let Hj = {h : X [K] | ρclf(gj, h) ηj} (here Hj is a set of classifiers that differ from Qj which is a set of conditional distributions). We learn a classifier f e F := F T j Hj. The constraints lead to a smaller class e F F , which implies that a typical measure of complexity V of the function class e F such as Rademacher complexity is also smaller than F. Thus, we can quantify the benefit of the weak labeler constraints via the ratio V( e F)/V(F). However, such a function class statistical complexity analysis merely makes the natural qualitative point that we require less labeled data to learn from e F than F. It does not provide a bound per se on the accuracy of our learned classifier, because we are not using any labeled data to select f. 4.1 ERROR GUARANTEE ON AGREEMENT REGION Recall that we minimize R(f) subject to satisfying constraints. We start with the observation that the weak labeler constraint directly restricts the output of a classifier to be similar to that of the weak labeler in its coverage set. For example, given a weak labeler that makes a constant prediction in a small region, any linear separator that satisfies the accuracy bound constraint can only slice off the coverage set by a fraction of at most η (Figure 1 (left)). This leads to an area in the middle where the decision boundary can t pass through: otherwise, the classifier would differ too much from the weak labeler. This resembles the classical notion of agreement regions (Cohn et al., 1994; Balcan et al., 2006), which we thus use to characterize the impact of weak labelers on F. Definition 4.1. For a set of classifiers F, the agreement region is the set of all points where any classifier in F agrees, Agree(F) = {x X | f1, f2 F, f1(x) = f2(x)}. Under realizability, we can show that the error of any f e F in the agreement region is zero. Furthermore, we can show that as multiple constraints are used, the resulting agreement region will always be larger than or equal to the union of agreement regions for individual constraints. In Figure 1, we provide an example with linear classifiers where the agreement regions from multiple constraints are larger than the union of the agreement region from each individual constraint. This is a possible explanation of the implicit denoising effect of multiple weak labelers when learning from a function class F. Formally, we denote an error of a classifier f within the set S as err S(f) := Pr(f(X) = f (X) | S). (26) Lemma 4.2. For any function class F, and corresponding constrained class e F, so long as f e F, for any f e F, err Agree( e F)(f) = 0. Lemma 4.3. For a function class F and constraints H1, H2, we have Agree(F H1) Agree(F H2) Agree(F H1 H2). Published as a conference paper at ICLR 2025 Figure 1: The figure shows the agreement region of the class of linear separators given two weak labeler constraints: one weak labeler labels red in the red circle, and the other labels blue in the blue circle. A linear separator can only differ from each weak labeler by a fraction of η of its support, and therefore the agreement region is given by the middle region of each coverage set (left). The agreement regions from multiple constraints are larger than the union of the agreement region from each individual constraint (right). 4.2 ERROR GUARANTEE ON THE COVERAGE UNION In the previous section, we used the geometry of constrained function classes in general. Here, we note that the specific form of our constraints (weak labelers error bound) naturally leads to an error guarantee via triangle inequality. Lemma 4.4. Let Hj = {h : X [K] | ρclf(gj, h) ηj} then for any f Hj, err Sj(f) ηj + ρclf(gj, f ) (27) We show that it is possible to have a better guarantee than the combination of the error bound from the triangle inequality when we have multiple constraints with conflicting regions. Theorem 4.5. Let H1, . . . , Hk be constraints of the form Hj = {h : X [K] | ρclf(gj, h) ηj}, each with different constant weak labels, gj(X) = j 1 for all j = 1, . . . , k, then for any f T j Hj, we have j=1 (ηj + ρclf(gj, f ))Pr(Sj) σ {1,...,k}, |σ| 2 (2|σ| 3)Pr(Bσ) Pr(S) . (28) when S = Sk j=1 Sj, Bσ = (T i {1,...,k}\σ Sc i ) are minterms. The first term on the RHS is the error guarantee from naively combining multiple error guarantees based on the triangle inequality. The improvement is in the second term which is the region where weak labelers have a conflict (Bσ), more conflict leads to a better error bound. The main idea is that whenever there is a conflict, a classifier can match with only one weak labeler. Therefore, to be approximately close to all weak labelers, it must closely match the other weak labelers in areas outside of the conflict. We note that while our result is applicable to when each weak labeler has a different label if we have multiple weak labelers that predict the same label, we can merge them into a super weak labeler by combining the coverage region e.g. S1 S2 and then derive the new η as a linear combination of η1, η2 e.g. η = η1 Pr(S1)+η2 Pr(S2) Pr(S1)+Pr(S2) , before applying Theorem 4.5. 4.3 EXTENDING ERROR GUARANTEES TO ENTIRE INPUT SPACE The previous two sections provided error guarantees in subsets of the input space: the first over the agreement region of the constrained function class, and the second on the union of the coverage sets of the weak labelers. In this section, we provided a more nuanced analysis based on a smoothness argument which can be used to generalize beyond the coverage set or the agreement region. Intuitively, if f performs well on one region, and the underlying probability distribution is smooth then we would expect it to perform well in a neighbor region as well. Theorem 4.6. Let P be a joint distribution over X Y such that Pr(Y = y | X = x) is L-Lipschitz w.r.t. a metric d where for all k = 1, . . . , K and x1, x2 X, | Pr(Y = k | X = x1) Pr(Y = k | X = x2)| Ld(x1, x2), (29) Published as a conference paper at ICLR 2025 then for any classifier f and a region S, we have Pr(f(X) = Y ) Pr(f(X) = Y | X S)(1 + tf(S)) + Luf(S) (30) The terms tf(S), uf(S) are distance between PX and PX condition on X S, tf(S) = sup k |Pr(f(X) = k | X S) Pr(f(X) = k) Pr(f(X) = k | X S) | (31) k Pr(f(X) = k)Wd(PX|S {x|f(x)=k}, PX|{x|f(x)=k}) (32) and Wd is a Wasserstien distance and we denote PX|A for a distribution PX condition on X A. In this theorem, we analyze an error w.r.t. to y instead of f since we require the smoothness of the underlying data distribution. The generalization bound is small whenever PX is close to PX condition on X S. First, tf(S) is zero whenever Pr(f(X) = k | X S) = Pr(f(X) = k) for all k, that is, the proportion of label k predicted by f is the same in both original distribution and the distribution conditional on S. Second, uf(S) is a bit stronger where it is small when the Wasserstein distance is small which is when S covers areas with high probability mass in X. The smoothness of f is implicit in the term uf(S) in terms of the regions {x | f(x) = k}. We provide full proof in Appendix I. We remark that tf(S), uf(S) only depends on the marginal distribution and, therefore, can be calculated solely from unlabeled data for any classifier f. 5 EXPERIMENTAL EVALUATION Bio CDR Chem IMDB Semeval TREC Yelp Youtube Avg. rank Sup (V) 64.21.1 59.90.6 35.91.4 64.31.3 59.63.0 52.13.7 73.41.4 80.91.2 6.8 MV 56.40.4 68.40.4 52.60.5 75.70.7 65.32.5 37.31.1 69.10.5 82.20.6 5.8 Snorkel 63.40.5 61.20.4 41.81.0 71.30.6 59.21.0 33.41.7 71.80.9 76.81.3 7.8 Lo L(S) 55.50.8 67.40.4 51.60.4 75.30.6 53.42.0 38.72.8 70.61.4 79.10.9 7 Ours (C) 63.50.5 68.50.7 52.20.8 74.31.3 67.52.1 46.72.7 78.60.5 80.71.3 4.6 Ours (V) 62.60.8 67.90.5 53.30.5 72.51.1 79.42.4 63.12.1 75.30.7 90.21.4 4.1 Sup (T) 75.40.8 80.70.5 80.10.2 81.40.6 82.91.9 66.92.3 87.00.3 89.70.6 1.1 Ours (T) 64.80.7 69.10.2 55.60.3 74.81.1 77.31.4 66.32.9 78.30.3 88.00.4 2.8 Oursclf(T) 60.90.9 66.00.4 49.80.7 73.31.0 81.01.9 57.21.2 75.80.5 87.30.3 5.1 Table 1: Average test accuracy and the corresponding standard error (over 5 random train-val-test split of the data) of our proposed algorithm and the baselines. We bold methods that are within a standard error of the best-performing method. Sup (T) and Sup (V) are supervised learning on the training set and the validation set respectively. MV, Snorkel, and Lol(S) are Majority vote, Snorkel, and Losses over labels (Simple) baselines. Ours is our proposed method (Algorithm 1) with different ways to estimate the error bound ηj. Ours (V) is when we estimate ηj from the labels from the validation set, given a Beta(1, 2) prior. Ours (C) is when we use a constant ηj and tune this on the validation set. Ours (T) is when ηj is estimated from the labels from the training set. Finally, Oursclf is our algorithm with respect to constraints on the classifier (Section 3.1). Experiment details: We show a comparison of our proposed method and baselines on 8 text classification datasets from the WRENCH benchmark Zhang et al. (2021). We provide the dataset statistics in Appendix G.1. For all methods and datasets, we use a neural network with a single hidden layer and a hidden size of 16 on top of the pre-trained BERT text embeddings. The neural network is trained with a full batch gradient descent with an Adam optimizer with a learning rate in [0.001, 0.003, 0.01], weight decay in [0.001, 0.003, 0.01] and a number of epochs in range(1, 500, 5). We tune these hyperparameters on the validation set of size 100 for each dataset. For our proposed method, we implement Algorithm 1 with an L2 regularization. Since the WRENCH benchmark does not provide error bounds ηj for each weak labeler, we consider 3 strategies to estimate ηj; i) we estimate ηj from the labels from the validation set, given a Beta(1, 2) prior, ii) we treat ηj as a fixed constant η for all weak labelers and tune this η on the validation set, iii) we estimate ηj from the labels from the training set which we treat as a ground truth error of each weak labeler. Published as a conference paper at ICLR 2025 See Appendix F for more details on estimating error bound from the vaildation set. Finally, we also implement our algorithm with respect to constraints on the classifier (Section 3.1) with a linear programming relaxation of the integer linear program (see Appendix G.3 for more details). Baselines: We compare with three baselines: Majority Vote (MV), Snorkel Metal (Ratner et al., 2017) and Losses over Labels simple (Lo L (S)) (Sam & Kolter, 2023). The majority vote and Snorkel are two-staged approaches where we first aggregate weak labels and then train a neural network on the aggregated labels. The Majority vote simply takes the majority vote between the non-abstain weak labels for each data point while Snorkel deploys a latent generative model to estimate soft labels from the weak labels. On the other hand, Losses over Labels (simple) is a one-staged approach that directly minimizes a weighted linear combination of the loss with respect to each weak labeler. We also compare to a baseline of simply performing supervised learning on labels from the validation set, to avoid concerns regarding possibly large validation sets as raised by Zhu et al. (2023). Results: Our proposed method (Ours (V)) is the best performing approach compared to other baselines (Table 1). This does not solely come from the fact that we have a large number of labels in the validation set since supervised learning on the validation set (with the best hyperparameters) still performs worse than ours on almost every dataset. Surprisingly, our method with a constant ηj (Ours (C)), also performs competitively and is the second best performing approach overall. On the other hand, having access to a more accurate error bound can lead to a performance increase across the board (Ours (T)). In addition, we found that most constraints are satisfied by the models trained by our algorithms (see Table G.2). Ablation on noisy weak labeler error bounds: We assume that the error bound ηj is given as a form of domain knowledge (or estimated from a small validation set). As such, we may have ηj that can be far from the ground truth error. In this section, we explore this impact on the performance of our algorithm and we do so by adding noise to the ground truth ηj (estimated from the training labels), given by ηj = ηj + ϵ, ϵ Uniform( a, a) and use ηj as inputs of our algorithm. We see that our algorithm is quite robust and the performance drops graciously as the noise level increases (Figure 2). 0.0 0.1 0.2 0.3 0.4 0.5 Noise Bioresponse CDR Chemprot IMDB Semeval Trec Yelp Youtube Figure 2: Impact of using noisy error bounds ηj = ηj + ϵ, ϵ Uniform( a, a) for a [0, 0.5]. The accuracy is averaged over 5 random splits of data and the shaded area is the standard error. 6 CONCLUSION In this paper we provided a principled approach to learn from weak labelers - weak classifiers that label a subset of the input space, without making stringent assumptions about dependencies between them, or about their noise model. Instead, we consider the expected error / accuracy as the only source of information that characterizes the weakness of labelers. This information can either be given by domain experts or can be estimated from a small amount of labeled data given only a weak prior on the accuracy. We finally proposed a robust and scalable algorithm that can learn from multiple noisy weak labelers even when only noisy estimates of expected errors are provided. An interesting future direction could be to combine information about dependencies between weak labelers when such information is given along with the information about the expected error, as well as developing alternative techniques based on mixed integer linear programming for solving our constraint-based approach to aggregate weak labelers by viewing the accuracy bound as a constraint. Published as a conference paper at ICLR 2025 ACKNOWLEDGMENTS We acknowledge the support of NSF via IIS-2211907 and IIS-1901403, ONR via N00014-23-12368 and Bloomberg Data Science Fellowship. Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2:343 370, 1988. Chidubem Arachie and Bert Huang. Constrained labeling for weakly supervised learning. Uncertainty in Artificial Intelligence, pp. 236 246, 2021. B Aragam, C Dan, E Xing, and P Ravikumar. Identifiability of nonparametric mixture models and bayes optimal clustering. The annals of statistics, 48(4), 2020. Abhijeet Awasthi, Sabyasachi Ghosh, Rasna Goyal, and Sunita Sarawagi. Learning from rules generalizing labeled exemplars. ar Xiv preprint ar Xiv:2004.06025, 2020. Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 449 458, 2014. Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient learning of linear separators under bounded noise. In Conference on Learning Theory, pp. 167 190. PMLR, 2015. Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Conference on Learning Theory, pp. 152 192. PMLR, 2016. Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for efficiently learning linear separators with noise. Journal of the ACM (JACM), 63(6):1 27, 2017. Stephen H Bach, Bryan He, Alexander Ratner, and Christopher R e. Learning the structure of generative models without labeled data. In International Conference on Machine Learning, pp. 273 282. PMLR, 2017. Maria-Florina Balcan and Nika Haghtalab. Noise in classification. https://arxiv.org/pdf/ 2010.05080.pdf, 2020. Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 65 72, 2006. Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014. Avrim Blum, Alan Frieze, Ravi Kannan, and Santosh Vempala. A polynomial-time algorithm for learning noisy linear threshold functions. Algorithmica, 22:35 52, 1998. Paul T Boggs and Jon W Tolle. Sequential quadratic programming. Acta numerica, 4:1 51, 1995. Salva R uhling Cachay, Benedikt Boecking, and Artur Dubrawski. End-to-end weak supervision. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id= gbcsm D3Iznu. Nicolo Cesa-Bianchi, Eli Dichterman, Paul Fischer, Eli Shamir, and Hans Ulrich Simon. Sampleefficient strategies for learning in the presence of noise. Journal of the ACM (JACM), 46(5): 684 719, 1999. Nontawat Charoenphakdee, Jongyeong Lee, and Masashi Sugiyama. On symmetric losses for learning from corrupted labels. In International Conference on Machine Learning, pp. 961 970. PMLR, 2019. Published as a conference paper at ICLR 2025 Mayee F Chen, Daniel Y Fu, Dyah Adila, Michael Zhang, Frederic Sala, Kayvon Fatahalian, and Christopher R e. Shoring up the foundations: Fusing model embeddings and weak supervision. In Uncertainty in Artificial Intelligence, pp. 357 367. PMLR, 2022. Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. ar Xiv preprint ar Xiv:1509.03025, 2015. Edwin KP Chong, Wu-Sheng Lu, and Stanislaw H Zak. An Introduction to Optimization: With Applications to Machine Learning. John Wiley & Sons, 2023. Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pp. 1310 1320. PMLR, 2019. David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine Learning, 15:201 221, 1994. David W Coit, Alice E Smith, and David M Tate. Adaptive penalty methods for genetic optimization of constrained combinatorial problems. INFORMS Journal on Computing, 8(2):173 182, 1996. George B Dantzig. Linear programming. Operations research, 50(1):42 47, 2002. Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical, 2018. Ilias Diakonikolas, Themis Gouleakis, and Christos Tzamos. Distribution-independent pac learning of halfspaces with massart noise. Advances in Neural Information Processing Systems, 32, 2019a. Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high dimensions without the computational intractability, 2019b. Ilias Diakonikolas, Daniel Kane, Pasin Manurangsi, and Lisheng Ren. Cryptographic hardness of learning halfspaces with massart noise. Advances in Neural Information Processing Systems, 35: 3624 3636, 2022a. Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Learning general halfspaces with general massart noise under the gaussian distribution. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pp. 874 885, 2022b. Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pp. 265 284. Springer, 2006. Enrique Estell es-Arolas and Fernando Gonz alez-Ladr on-de Guevara. Towards an integrated crowdsourcing definition. Journal of Information science, 38(2):189 200, 2012. Michel Fortin and Roland Glowinski. Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems. Elsevier, 2000. Marguerite Frank, Philip Wolfe, et al. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95 110, 1956. Daniel Fu, Mayee Chen, Frederic Sala, Sarah Hooper, Kayvon Fatahalian, and Christopher R e. Fast and three-rious: Speeding up weak supervision with triplet methods. In International Conference on Machine Learning, pp. 3280 3291. PMLR, 2020. Kuzman Ganchev, Joao Grac a, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 11:2001 2049, 2010. Philip E Gill, Walter Murray, and Margaret H Wright. Practical optimization. SIAM, 2019. Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31, 2018. Published as a conference paper at ICLR 2025 Bo Han, Gang Niu, Xingrui Yu, Quanming Yao, Miao Xu, Ivor Tsang, and Masashi Sugiyama. Sigua: Forgetting may make learning with noisy labels more robust. In International Conference on Machine Learning, pp. 4006 4016. PMLR, 2020. Earl O Heady, Candler Wilfried, et al. Linear programming methods. Linear programming methods., 1963. Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep neural networks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2410 2420, 2016. Giannis Karamanolakis, Subhabrata Mukherjee, Guoqing Zheng, and Ahmed Hassan Awadallah. Self-training with weak supervision. ar Xiv preprint ar Xiv:2104.05514, 2021. Davood Karimi, Haoran Dou, Simon K Warfield, and Ali Gholipour. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Medical image analysis, 65:101759, 2020. George Em Karniadakis, Ioannis G Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. Physics-informed machine learning. Nature Reviews Physics, 3(6):422 440, 2021. Hiroyuki Kasahara and Katsumi Shimotsu. Non-parametric identification and estimation of the number of components in multivariate mixtures. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):97 111, 2014. Michael J Kearns and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994. Zhaobin Kuang, Chidubem Arachie, Bangyong Liang, Pradyumna Narayana, Giulia De Salvo, Michael Quinn, Bert Huang, Geoffrey Downs, and Yang Yang. Firebolt: Weak supervision under weaker assumptions. In International Conference on Artificial Intelligence and Statistics 2022, 2022. Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations, 2016. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. Pascal Massart and Elodie N ed elec. Risk bounds for statistical learning. 2006. Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1 35, 2021. Nagarajan Natarajan, Inderjit S Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Cost-sensitive learning with noisy labels. J. Mach. Learn. Res., 18(1):5666 5698. Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. Advances in neural information processing systems, 26, 2013. Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82, 2018. URL https://api.semanticscholar.org/Corpus ID: 3614648. Adarsh Prasad, Vishwak Srinivasan, Sivaraman Balakrishnan, and Pradeep Ravikumar. On learning ising models under huber s contamination model. Advances in neural information processing systems, 33:16327 16338, 2020. Rattana Pukdee, Dylan Sam, Pradeep Kumar Ravikumar, and Nina Balcan. Label propagation with weak supervision. In The Eleventh International Conference on Learning Representations, 2022. Published as a conference paper at ICLR 2025 Rattana Pukdee, Dylan Sam, J Zico Kolter, Maria-Florina Balcan, and Pradeep Ravikumar. Learning with explanation constraints. Advances in neural information processing systems, 2023. Alex Ratner, Braden Hancock, Jared Dunnmon, Roger Goldman, and Christopher R e. Snorkel metal: Weak supervision for multi-task learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, pp. 1 4, 2018. Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R e. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, pp. 269. NIH Public Access, 2017. Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher R e. Training complex models with multi-task weak supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4763 4771, 2019. Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R e. Data programming: Creating large training sets, quickly. Advances in neural information processing systems, 29, 2016. R Tyrrell Rockafellar. Lagrange multipliers and optimality. SIAM review, 35(2):183 238, 1993. Dylan Sam and J. Zico Kolter. Losses over labels: Weakly supervised learning via direct loss construction, 2023. Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022. Paroma Varma, Bryan D He, Payal Bajaj, Nishith Khandwala, Imon Banerjee, Daniel Rubin, and Christopher R e. Inferring generative model structure with static analysis. Advances in neural information processing systems, 30, 2017. Kerri Wazny. crowdsourcing ten years in: A review. Journal of global health, 7(2), 2017. Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, and Masashi Sugiyama. Are anchor points really indispensable in label-noise learning? Advances in neural information processing systems, 32, 2019. Xiaobo Xia, Tongliang Liu, Bo Han, Nannan Wang, Mingming Gong, Haifeng Liu, Gang Niu, Dacheng Tao, and Masashi Sugiyama. Part-dependent label noise: Towards instance-dependent label noise. Advances in Neural Information Processing Systems, 33:7597 7610, 2020. Ozg ur Yeniay. Penalty function methods for constrained optimization with genetic algorithms. Mathematical and computational Applications, 10(1):45 56, 2005. Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. How does disagreement help generalization against label corruption? In International Conference on Machine Learning, pp. 7164 7173. PMLR, 2019. Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International conference on machine learning, pp. 325 333. PMLR, 2013. Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. Wrench: A comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. Jieyu Zhang, Cheng-Yu Hsieh, Yue Yu, Chao Zhang, and Alexander Ratner. A survey on programmatic weak supervision. ar Xiv preprint ar Xiv:2202.05433, 2022. Jieyu Zhang, Linxin Song, and Alex Ratner. Leveraging instance features for label aggregation in programmatic weak supervision. In International Conference on Artificial Intelligence and Statistics, pp. 157 171. PMLR, 2023. Published as a conference paper at ICLR 2025 Wen Li Zhao, Pierre Gentine, Markus Reichstein, Yao Zhang, Sha Zhou, Yeqiang Wen, Changjie Lin, Xi Li, and Guo Yu Qiu. Physics-constrained machine learning of evapotranspiration. Geophysical Research Letters, 46(24):14496 14507, 2019. Dawei Zhu, Xiaoyu Shen, Marius Mosbach, Andreas Stephan, and Dietrich Klakow. Weaker than you think: A critical look at weakly supervised learning, 2023. Yinhao Zhu, Nicholas Zabaras, Phaedon-Stelios Koutsourelakis, and Paris Perdikaris. Physicsconstrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. Journal of Computational Physics, 394:56 81, 2019. Published as a conference paper at ICLR 2025 A ADDITIONAL RELATED WORK A.1 PROGRAMMATIC WEAK SUPERVISION. Programmatic weak supervision Ratner et al. (2016), is a learning paradigm where subject matter expert encodes their domain knowledge in terms of weak labelers (labeling functions). These are functions that either provide labels (weak labels) or abstain from prediction on any input. The key question is how to learn from such weak labels that are potentially noisy. A general approach involves two steps of aggregating weak labels into less noisy labels and then training a discriminative classifier on the inferred labels. Various statistical label models have been developed for label aggregation, and most of these only depend on the weak labels Ratner et al. (2016; 2018; 2019); Bach et al. (2017); Varma et al. (2017); Fu et al. (2020); Kuang et al. (2022). Recent methods incorporate the input information into this weakly supervised learning pipeline Chen et al. (2022); Pukdee et al. (2022); Zhang et al. (2023) e.g. smoothness of the label with respect to the input instance or pretrained-embedding. Alternatively, there is a line of works that learns from weak labels in an end-to-end fashion. The label model and the discriminative model are parameterized and are trained jointly Cachay et al. (2021); Karamanolakis et al. (2021); Awasthi et al. (2020); Sam & Kolter (2023). A key prior work is constrained label learning Arachie & Huang (2021), which involves learning pseudo labels under constraints before training a discriminative model. Unlike our method, CLL s initial step is independent of the function class and remains a two-step process. A.2 LEARNING WITH NOISY LABELS Learning from data with noisy labels has been a prominent research problem that has received substantial attention from the machine learning community. The main assumption is that the observed labels are flipped with some probability. There is a line of works in theoretical machine learning that focuses on the learnability and polynomial-time algorithms for various noise models Kearns & Vazirani (1994); Cesa-Bianchi et al. (1999); Angluin & Laird (1988); Awasthi et al. (2014; 2015); Diakonikolas et al. (2019a); Balcan & Haghtalab (2020). On the other hand, in the statistical learning theory community, various works are concerned with different surrogate losses or unbiased estimators Natarajan et al. (2013); Prasad et al. (2020); Natarajan et al.; Charoenphakdee et al. (2019); Xia et al. (2019; 2020). Recently, numerous studies have focused on learning noisy labels on deep neural networks Song et al. (2022); Karimi et al. (2020); Yu et al. (2019); Han et al. (2018; 2020). Despite rich theoretical and empirical research in this area, these approaches cannot be employed in our context since the noisy labels in our setting are deterministic given the input and label. A.3 LEARNING FROM CONSTRAINTS Constrained optimization focuses on finding the optimal solution to a problem within given constraints Bertsekas (2014); Gill et al. (2019); Chong et al. (2023). The principal method involves converting the constrained optimization problem into an unconstrained one. This can be accomplished by adding auxiliary variables, as seen in the Lagrange multiplier method Rockafellar (1993), or by incorporating a penalty term that penalizes the objective function when constraints are violated Coit et al. (1996); Fortin & Glowinski (2000); Yeniay (2005). Additionally, gradient-based approaches like PGD include a projection step to ensure the solution stays within a feasible region after each update Chen & Wainwright (2015); Madry et al. (2018). For specific types of constraints such as linear or quadratic, there are dedicated algorithms designed to address these constrained scenarios Heady et al. (1963); Dantzig (2002); Frank et al. (1956); Boggs & Tolle (1995). In the field of machine learning, the optimization goal is the empirical loss, and constraints can be used to enforce desirable features in a machine learning model, such as fairness Zemel et al. (2013); Mehrabi et al. (2021), robustness Cohen et al. (2019); Kurakin et al. (2016), or privacy Dwork et al. (2006). Recent studies have begun incorporating constraints derived from domain knowledge into models, such as physics constraints Zhao et al. (2019); Zhu et al. (2019); Karniadakis et al. (2021), explanation constraints Pukdee et al. (2023), and weak supervision Ganchev et al. (2010); Hu et al. (2016), with the aim of improving model performance. Published as a conference paper at ICLR 2025 B PROOFS OF PROPERTIES OF PROJECTION LOSS AND COMPUTING THE PROJECTION Lemma B.1. For Qclf defined in equation 5, i.e. j Qj,clf, where Qj,clf = {q : X K | ρclf(gj, clf(q)) ηj}. (33) where ρclf(gj, clf(q)) = E[gj(X) = clf(q)(X)|gj(X) = ϕ] (34) and clf(q) is defined as clf(q)(x) = arg max y q(x)y (35) (assuming arg maxy q(x)y is unique. If arg maxy q(x)y is not unique for some x then clf(q) is not well defined. ) then for any p such that clf(p) is well defined, we have d(p, Qclf) = infq Qclf E[DKL(p(X)||q(X))] = 0 if and only if p Qclf. Proof. If p Qclf, we would have d(p, Qclf) E[DKL(p(X)||p(X))] = 0. Conversely when d(p, Qclf) = 0 we prove that p Qclf i.e. ρclf(gj, p) ηj for all j [m]. First, we prove that inf q Qclf E[d T V (p(X), q(X))] = 0 (36) where d T V is defined as d T V (u, v) = 1/2 k=1 |uk vk| (37) Using Pinsker s inequality, d T V (u, v) p DKL(u||v)/2 (38) we have E[d T V (p(X), q(X))] = E[d T V (p(X), q(X))] (39) DKL(p(X)||q(X))/2] (40) E[DKL(p(X)||q(X))]/2 (from Jenson s) (41) Thus inf q Qclf E[d T V (p(X), q(X))] inf q Qclf E[DKL(p(X)||q(X))]/2 (42) inf q Qclf E[DKL(p(X)||q(X))]/2 (43) 0 (44) Let us say p Qclf, there exists some ϵ > 0 such that ρclf(gj, p) ηj + ϵ for some j [m]. Pr(gj(X) = clf(p)(X)|X Sj) ηj + ϵ (45) where Sj = {x : gj(x) = ϕ} is the coverage set of gj. Let S Sj defined as S = {x : gj(x) = clf(p)(x)}. Denoting |S| = Pr(X S), since |S |/|Sj| ηj + ϵ, and assuming |Sj| > 0, we have |S | > 0. Now for any q Qclf, we have Pr(clf(q)(X) = gj(X)|X Sj) ηj (46) Then Pr(clf(q)(X) = gj(X)|X Sj) = Pr(clf(q)(X) = gj(X)|X S , X Sj)P(X S |X Sj) (47) = Pr(clf(q)(X) = gj(X)|X S )P(X S , X Sj)/P(X Sj) (48) = Pr(clf(q)(X) = gj(X)|X S )P(X S )/P(X Sj) (49) = Pr(clf(q)(X) = gj(X)|X S )|S |/|Sj| (50) Published as a conference paper at ICLR 2025 Pr(clf(q)(X) = gj(X)|X S ) = Pr(clf(q)(X) = gj(X)|X Sj)|Sj|/|S | (51) ηj|Sj|/|S | (52) ηj/(ηj + ϵ) (53) Pr(clf(q)(X) = gj(X)|X S ) = 1 Pr(clf(q)(X) = gj(X)|X S ) (54) 1 ηj/(ηj + ϵ) (55) ϵ/(ηj + ϵ) (56) Since clf(p)(x) = gj(x) for x S , clf(q)(x) = gj(x) implies clf(p)(x) = clf(q)(x). Thus Pr(clf(q)(X) = clf(p)(X)|X S ) ϵ/(ηj + ϵ) (57) Let S = {x S : clf(q)(x) = clf(p)(x)}, we have Pr(X S |X S ) ϵ/(ηj + ϵ) (58) Pr(X S |X S )P(X S ) = P(X S , X S ) = P(X S ) (59) Pr(X S |X S ) = P(X S )/P(X S ) (60) P(X S )/P(X S ) ϵ/(ηj + ϵ) (61) |S | |S |ϵ/(ηj + ϵ) > 0 (62) Now let µ(x) = max k =clf(p)(x) p(x)clf(p)(x) p(x)k (63) For clf(p) to be well defined, µ(x) must be greater than 0, i.e. µ(x) > 0 for all x. Since clf(p)(x) = clf(q)(x) for x S , we have for any x S p(x)clf(p)(x) p(x)clf(q)(x) + µ(x) (64) q(x)clf(q)(x) q(x)clf(p)(x) (65) p(x)clf(p)(x) p(x)clf(q)(x) + µ(x) (66) q(x)clf(p)(x) q(x)clf(q)(x) (67) p(x)clf(p)(x) q(x)clf(p)(x) p(x)clf(q)(x) q(x)clf(q)(x) + µ(x) (68) (p(x)clf(p)(x) q(x)clf(p)(x))+(q(x)clf(q)(x) p(x)clf(q)(x)) µ(x) (69) |p(x)clf(p)(x) q(x)clf(p)(x)|+|p(x)clf(q)(x) q(x)clf(q)(x)| µ(x) (70) d T V (p(x), q(x)) µ(x)/2 (71) E[d T V (p(X), q(X))|X S ] E[µ(X)/2|X S ] (72) Since infq Qclf E[d T V (p(X), q(X))] = 0, there exists some q Qclf such that E[d T V (p(X), q(X))] < |S|E[µ(X)/2|X S] (73) for any set S with |S| > 0. In particular this holds for S = S since |S | > 0. Then E[d T V (p(X), q(X))] E[d T V (p(X), q(X))I(X S )] (74) = E[d T V (p(X), q(X))|X S ] Pr(X S ) (75) = E[d T V (p(X), q(X))|X S ]|S | (76) E[d T V (p(X), q(X)|X S )]|S | < |S |E[µ(X)/2|X S ] (77) E[d T V (p(X), q(X)|X S )] < E[µ(X)/2|X S ] (78) which leads to a contradiction. Published as a conference paper at ICLR 2025 Lemma B.2. Let Q = {q : clf(q) H} for some set H of classifiers. For any classifier f and any ϵ > 0, there exists a conditional distribution p such that clf(p) = f and d(p, Q) < ϵ. Proof. We claim that for any classifiers f1, f2 there exists sequences of conditional distributions {qt 1} t=1, {qt 2} t=1 such that clf(qt 1) = f1, clf(qt 2) = f2 for all t, and limt qt 1 = limt qt 2. The main idea is that we can slightly perturb a uniform distribution to achieve any classifier. Formally, we define qt 1(x)k = 1/K δ/((K 1) t) for k = f1(x) and qt 1(x)k = 1/K + δ/t for k = f1(x), where K = |Y| is the number of classes. We can define qt 2 in the same manner. It s clear that both sequence converges to a uniform distribution, that is, limt qt 1(x)y = limt qt 2(x)y = 1/K for all x, y even when f1, f2 are completely different. Now for any classifier h H and any classifier f, there exists a sequence {qt h}, {qt f} such that clf(qt h) = h, clf(qt f) = f for all t and that limt qt h = limt qt f. Since clf(qt h) = h, we know that qt h Q and this implies that lim t d(qt f, Q) = lim t min q Q E[DKL(q(X)||qt f(X))] (79) lim t E[DKL(qt h(X)||qt f(X))] = 0. (80) The last line comes from the fact that limt qt h = limt qt f. As a result, we can conclude that for any ϵ, there exists some t for which p = qt h satisfies d(p, Q) < ϵ and clf(p) = f. Proposition B.3. For any ϵ, ν > 0 there exists δ(ϵ, ν) > 0 such that if d(p, Λϵ Qclf) < δ(ϵ, ν) then ρclf(gj, clf(p)) ηj + ν/|Sj|, where |Sj| = PX(X Sj) is the size of coverage set Sj of weak labeler gj with respect to marginal distribution PX. In particular δ(ϵ, ν) = νL(ϵ) (1+ν) where L(ϵ) is a strictly increasing function in ϵ given by L(ϵ) = (1 + ϵ) log(1 + ϵ) (2 + ϵ) log(1 + ϵ/2) where K is the number of classes. The bound can be rephrased as the following inequality: For all p such that d(p, Qϵ clf) < L(ϵ), we have ρclf(gj, clf(p)) ηj + d(p, Qϵ clf) (L(ϵ) d(p, Qϵ clf))|Sj| (82) where Qϵ clf = Λϵ Qclf. Proof. First, we prove the following ρclf(gj, clf(p)) ηj + min q Qclf dclf(p, q)/|Sj| (83) where dclf(p, q) is defined as dclf(p, q) = Pr X (clf(p)(X) = clf(q)(X)) (84) ρclf(gj, clf(p)) = Pr X (clf(p)(X) = gj(X)|X Sj) Pr X (clf(p)(X) = clf(q)(X)|X Sj) + Pr X (clf(q)(X) = gj(X)|X Sj) for all q ρclf(gj, clf(p)) min q Qclf Pr X (clf(p)(X) = clf(q)(X)|X Sj) + Pr X (clf(q)(X) = gj(X)|X Sj) min q Qclf Pr X (clf(p)(X) = clf(q)(X)|X Sj) + max q Qclf Pr X (clf(q)(X) = gj(X)|X Sj) min q Qclf Pr X (clf(p)(X) = clf(q)(X)|X Sj) + ηj min q Qclf Pr X (clf(p)(X) = clf(q)(X), X Sj)/|Sj| + ηj min q Qclf Pr X (clf(p)(X) = clf(q)(X))/|Sj| + ηj = ηj + min q Qclf dclf(p, q)/|Sj| Published as a conference paper at ICLR 2025 Next, we will prove the bound on infq Qclf dclf(p, q), i.e. inf q Qclf dclf(p, q) d(p, Qϵ clf) (L(ϵ) d(p, Qϵ clf)) (85) For any conditional distributions p, q X K define the following dkl(p, q) = E[DKL(q(X)||p(X))] (86) d+ kl(p, q) = E[DKL(q(X)||p(X))| clf(p)(X) = clf(q)(X)] (87) d kl(p, q) = E[DKL(q(X)||p(X))| clf(p)(X) = clf(q)(X)] (88) Consider the following sublemma. Sublemma B.3.1. For any q Qϵ clf, if dkl(p, q) < L(ϵ) then dclf(p, q) dkl(p, q) L(ϵ) dkl(p, q) (89) where L(ϵ) = inf q Qϵ clf,p d kl(p, q) (90) The bound in equation 85 directly follows from the above sublemma. This is proved as follows. inf q Qclf dclf(p, q) = inf q Qϵ clf dclf(p, q) inf q Qϵ clf dkl(p, q) L(ϵ) dkl(p, q) infq Qϵ clf dkl(p, q) supq Qϵ clf L(ϵ) dkl(p, q) = infq Qϵ clf dkl(p, q) L(ϵ) infq Qϵ clf dkl(p, q) = d(p, Qϵ clf) L(ϵ) d(p, Qϵ clf) Finally d(p, Qϵ clf) νL(ϵ) (1+ν) gives us infq Qclf dclf(p, q) ν, and thus ρclf(gj, clf(p)) ν/|Sj|+ηj. Now we proof sublemma B.3.1. First note that, dkl(p, q) = dclf(p, q)d kl(p, q) + (1 dclf(p, q))d+ kl(p, q) (91) d+ kl(p, q) = dkl(p, q) dclf(p, q)d kl(p, q) 1 dclf(p, q) dkl(p, q) dclf(p, q)L(ϵ) 1 dclf(p, q) dkl(p, q) + (dkl(p, q) L(ϵ)) dclf(p, q) 1 dclf(p, q) < dkl(p, q) dclf(p, q) = dkl(p, q) d+ kl(p, q) d kl(p, q) d+ kl(p, q) dkl(p, q) d kl(p, q) d+ kl(p, q) < dkl(p, q) L(ϵ) dkl(p, q) Published as a conference paper at ICLR 2025 Now we are only left to prove the form of L(ϵ) given in equation 81. Recall Λϵ was defined in equation 12 as Λϵ := {q : X k | for each x, y : q(x)y (1 + ϵ)q(x)k for k = y} (92) For a distribution u K, let cj(u) refer to the top j th class, i.e. c1(u), . . . c K(u) is a permutation of [K] s.t. uc1(u) . . . uc K(u). Let tj(u) refer to the corresponding probability i.e. tj(u) = ucj(u). Let λϵ = {u K : t1(u) (1 + ϵ)tj(u) j > 1}. This is connected to Λϵ as Λϵ = {q : X K : x, q(x) λϵ} L(ϵ) = inf q Qϵ clf,p d kl(p, q) (93) = inf q Qϵ clf,p E[DKL(q(X)||p(X))| clf(p)(X) = clf(q)(X)] (94) = inf h H,p E[ inf u λϵ,c1(u)=h(X) DKL(u||p(X))| clf(p)(X) = c1(u)] (95) (where H = {h : ρclf(gj, h) ηj}) (96) = inf h H E[ inf u λϵ:c1(u)=h(X),v K:c1(v) =c1(u) DKL(u||v)] (97) = inf u λϵ,v K:c1(v) =c1(u) DKL(u||v) (98) = inf u,v K DKL(u||v) : c1(v) = c1(u), t1(u) (1 + ϵ)tj(u) j > 1 (99) Now without loss of generality, let cj(u) = j. With some abuse of notation, we also use u, v to refer to the minimizer of objective in equation 99. The minimizer exists since the constraints induce closed sets. Then first we will prove that cj(v) = cj(u) = j for j 3 (100) Assume the contrary. Then there exists k 3 such that vk > v2. This implies (u2 uk) log(vk/v2) > 0 (101) u2 log(1/v2) + uk log(1/vk) > u2 log(1/vk) + uk log(1/v2) (102) u2 log(u2/v2) + uk log(uk/vk) > u2 log(u2/vk) + uk log(uk/v2) (103) DKL(u||v) > DKL(u||v ) (104) where v is defined as v 2 = vk, v k = v2, v j = vj for j = k, 2 (105) But u, v was the minimizer of the objective in equation 99, which contradicts the above implication in equation 104. Since c1(u) = 1 = c1(v), we get c1(v) = 2, and c2(v) = 1. To recap we have cj(u) = j (106) c1(v) = 2, c2(v) = 1, cj(v) = j for j 3 (107) Next we will prove that v1 = v2 (108) Let x = (v2 v1)/2 (109) y = (v2 + v1)/2 (110) Since c1(v) = 2, v2 v1 and so x, y 0. We have v1 = y x (111) v2 = y + x (112) j=3 vj = 1 2y (113) Published as a conference paper at ICLR 2025 L(ϵ) = u1 log(u1/y x) + u2 log(u2/y + x) + j=3 uj log(uj/vj) x L(ϵ) = u1/(y x) u2(y + x) = u1(y + x) u2(y x) = x(u1 + u2) + y(u1 u2) Thus L(ϵ) is minimized for x = 0. Next, we have the following inequality i=1 ai log(ai/bi) ( i=1 ai) log( i=1 bi) (114) This follows Jenson s inequality, given a discrete distribution f : Pn i=1 fi = 1, we have i=1 fi log(ti) log( Substituting fi = ai/ Pn i=1 ai, and ti = bi/ai, we get the required inequality. We use this inequality to have K X j=3 uj log(uj/vj) ( j=3 uj) log( j=3 vj) (115) This gives us L(ϵ) u1 log(u1/v1) + u2 log(u2/v2) + (1 u1 u2) log 1 u1 u2 We had v1 = v2. So, substituting v1 = v2 = y, we have L(ϵ) f(y) = u1 log u1/y + u2 log u2/y + (1 u1 u2) log 1 u1 u2 yf(y) = u1/y u2/y + 2(1 u1 u2) 2 yf(y) = 2(1 u1 u2) (1 2y)2 > 0 Thus f(y) is minimized for yf(y) = 0, giving y = v1 = v2 = (u1 + u2)/2. Substituting in equation 116, we get L(ϵ) u1 log(2u1/(u1 + u2)) + u2 log(2u2/(u1 + u2)) (117) Let x = (u1 u2)/2 (and thus u1 = y + x, u2 = y x) we have L(ϵ) f(x) = (y + x) log(1 + x/y) + (y x) log(1 x/y) xf(x) = log(1 + x/y) log(1 x/y) > 0 Since u1 (1 + ϵ)u2, we have y + x (1 + ϵ)(y x), thus f(x) is minimized for x = yϵ/(2 + ϵ). Using y = (u1 + u2)/2, we have L(ϵ) (u1 + u2) ((1 + ϵ)/(2 + ϵ) log(2(1 + ϵ)/(2 + ϵ)) + 2y/(2 + ϵ) log(2/(2 + ϵ))) ((1 + ϵ) log(1 + ϵ) (2 + ϵ) log(1 + ϵ/2)) Published as a conference paper at ICLR 2025 Now to bound u1 + u2, consider the following. Since uj u2 for j 3, we have j=2 uj = 1 u1 u1 + u2 1 + u1(K 2) Further since u1 (1 + ϵ)uj for j 2, we have, (K 1)u1 (1 + ϵ) u1 (1 + ϵ)(1 u1) Substituting back we get u1 + u2 2 + ϵ Substituting in the bound for L(ϵ) we finally get L(ϵ) (1 + ϵ) log(1 + ϵ) (2 + ϵ) log(1 + ϵ/2) K + ϵ (118) Now to prove L(ϵ) is strictly increasing and positive for ϵ > 0, we have L(0) = 0 and ϵL(ϵ) = ϵf(ϵ)/(K + ϵ) f(ϵ)/(K + ϵ)2 = (K 1) log(1 + ϵ) (K 2) log(1 + ϵ/2) Proposition B.4. Given Q = Λϵ Qclf for Λϵ defined in equation 12 and Qclf defined in equation 5, for any conditional distribution p, the optimal projection q = arg infq Q E[DKL(q(X)||p(X))] can be found as follows, 1. Solve for optimal clf(q ) as clf(q ) = arg min h X Y EX[Proj KL(p(X); K h(X),ϵ)] : ρclf(gj, h) ηj (119) K y,ϵ = r K | ry (1 + ϵ)rk , k = y , Proj KL(u; S) = inf r S DKL(r||u). (120) 2. Solve for an optimal distribution q given the optimal clf(q ). For each x, q (x) = arg min r DKL(r||p(x)) : r K,ϵ clf(q )(x) (121) Published as a conference paper at ICLR 2025 Proof. Given Λϵ = {q X K | x, q(x) SK k=1 K k,ϵ}. We have q = arg min q Q E[DKL(q(X)||p(X))] = arg min q Λϵ E[DKL(q(X)||p(X))] : q Qclf = arg min q Λϵ E[DKL(q(X)||p(X))] : ρ(gj, clf(q)) ηj j [m] = arg min q Λϵ min h X Y E[DKL(q(X)||p(X))] : ρ(gj, h) ηj, h = clf(q) = arg min q:q=q min q Λϵ min h X Y E[DKL(q (X)||p(X))] : ρ(gj, h) ηj, h = clf(q ) = arg min q:q=q min h X Y min q Λϵ E[DKL(q (X)||p(X))] : ρ(gj, h) ηj, h = clf(q ) = arg min q:q=q min h X Y min q Λϵ E[DKL(q (X)||p(X))] : clf(q ) = h : ρ(gj, h) ηj = arg min q:q(x)=ux min h X Y E min u X K DKL(u X||p(X)) : u X K h(X),ϵ) : ρ(gj, h) ηj q (x) = arg min u K DKL(u||p(X)) : u K h (X),ϵ (122) h = arg min h X Y E min u X K DKL(u X||p(X)) : u X K h(X),ϵ) : ρ(gj, h) ηj (123) Since u x K h (x),ϵ, it implies clf(u ) = h , and since q (x) = u x, clf(q ) = clf(u ) = h . Thus clf(q ) = min h X Y E[Proj KL(p(X); K h(X),ϵ)] : ρ(gj, h) ηj (124) q (x) = arg min u K DKL(u||p(X)) : u K clf(q )(X),ϵ (125) Proposition B.5. For Qdist defined in equation 16, we have d(p, Qdist) = 0 if and only if p Qdist. Furthermore, for any ν > 0 there exists δ(ν) > 0 such that if d(p, Qdist) < δ(ν) then ρdist(gj, p) ηj + ν/|Sj|, where |Sj| = PX(Sj) is the size of coverage set Sj of weak labeler gj with respect to marginal distribution PX. In particular δ(ν) = ν2/2. Rephrasing we get the following bound: ρdist(gj, p) ηj + p 2d(p, Qdist)/|Sj| Proof. Define E(p) = minq Qdist EX[d T V (p(X), q(X))], where d T V is the TV distance: d T V (u, v) = 1/2 k=1 |uk vk| (126) First, we prove the following ρdist(gj, p) ηj + 2 min q Qdist EX[d T V (p(X), q(X))]/|Sj| (127) Published as a conference paper at ICLR 2025 ρdist(gj, p) = E[ X k =gj(X) p(X)gj(X)|X Sj] (128) k =gj(X) p(X)gj(X) q(X)gj(X)|X Sj] + E[ X k =gj(X) q(X)gj(X)|X Sj] for all q k |p(X)k q(X)k|X Sj] + E[ X k =gj(X) q(X)gj(X)|X Sj] (130) k |p(X)k q(X)k|I(X Sj)]/|Sj| + ρdist(gj, q) (131) k |p(X)k q(X)k|]/|Sj| + ρdist(gj, q) (132) ρdist(gj, p) 2E[d T V (p(X), q(X))]/|Sj| + ρdist(gj, q) (133) The above holds for all q. Thus ρdist(gj, p) min q Qdist 2E[d T V (p(X), q(X))]/|Sj| + ρdist(gj, q) (134) min q Qdist 2E[d T V (p(X), q(X))]/|Sj| + max q Qdist ρdist(gj, q) (135) 2 min q Qdist E[d T V (p(X), q(X))]/|Sj| + ηj (136) Further using Pinsker s Inequality d T V (u, v) p DKL(u||v)/2 we have E[d T V (p(X), q(X))] = E[d T V (q(X), p(X))] (138) DKL(q(X)||p(X))/2] (139) E[DKL(q(X)||p(X))]/2 (from Jenson s) (140) min q Qdist E[d T V (p(X), q(X))] min q Qdist E[DKL(q(X)||p(X))]/2 (141) min q Qdist E[DKL(q(X)||p(X))]/2 (142) d(p, Qdist)/2 (143) Finally d(p, Qdist) ν2/2 implies minq Qdist E[d T V (p(X), q(X))] ν/2, and thus ρdist(gj, p) ν/|Sj| + ηj. Proposition B.6. Given Qdist as defined in equation 16, for any conditional distribution p, the optimal projection q = arg minq Qdist E[DKL(q(X)||p(X))] is given by q = pλ where pλ(x)y := Zλ(x) exp( X j:gj(x)=y λ)p(x)y (144) Zλ(x) is the normalization constant. The optimal λ is given by the following optimization: λ = min λ 0 j=1 λj s.t. ρdist(gj, pλ) ηj j [m] (145) Proof. The problem is a convex optimization with affine constraints since ρdist(gj, q) = E[P y =gj(X) q(X)y]. To restate it is, min q X K E[DKL(q(X)||p(X))] (146) y =gj(X) q(X)y] ηj j [m] (147) Published as a conference paper at ICLR 2025 In terms of unconstrained variable q X RK, this becomes min q X RK E[ k=1 q(X)k log(q(X)k/p(X)k)] (148) y =gj(X) q(X)y] ηj j [m] (149) y=1 q(x)y = 1 for all x (150) q(x)y [0, 1] for all x, and y [K] (151) Since log is undefined for nonpositive values and assuming p(x)y > 0 for all x, y, we only need to consider q(x)y R+, thus we can ignore the constraint q(x)y 0. q(x)y 1 also follows from the constraint PK y=1 q(x)y = 1. Hence the constraint q(x)y [0, 1] can be ignored. Rephrasing we have min q X RK + E[ k=1 q(X)k log(q(X)k/p(X)k)] (152) s.t. E[q(X)gj(X)] 1 ηj j [m] (153) k=1 q(x)k = 1 for all x (154) Here we subtracted from 1, both sides of the inequality constraint. Now we form the lagrangian L(q, λ, µ) where λ Rm +, a Lagrange multiplier for each inequality constraint, and µ X R, the Lagrange multiplier for equality constraints. L(q, λ, µ) =E[ k=1 q(X)k log(q(X)k/p(X)k)] k=1 q(X)k)] + j=1 λj(1 ηj E[q(X)gj(X)|X Sj]) min q X RK + max λ Rm + ,µ X R L(q, λ, µ) = min q Qdist E[DKL(q(X)||p(X))] Since the optimization is convex, a point (q , λ , µ ) that satisfies the KKT conditions is optimal. KKT stationarity condition implies that q L(q , λ , µ ) = 0 (155) where derivative of an objective L(g) = E[PK k=1 f(g(X, k))] where f : R R is a scalar function, with respect to a function g in X [K] R is defined using methods in calculus of variation as a function in X [K] R g L(g)(x, y) = E[δ(X = x)f (g(X, y))] = d(x)f (g(x, y)) (156) where d is the probability density associated with marginal distribution P on X. Thus we have q L(q , λ , µ )(x, y) = d(x)(1 + log q (x)y/p(x)y µ (x) X j λ j I(y = gj(x))) = 0 (157) y q (x)y = 1, we can eliminate µ (x) to get q (x)y for x in support of P (d(x) > 0) q (x)y = Zλ (x) exp( X j:gj(x)=y λ j)p(x)y (158) Published as a conference paper at ICLR 2025 where Zλ (x) is the normalization constant Zλ (x) = 1/( j:gj(x)=y λ j)p(x)y) (159) Complementary slackness condition on the inequality constraints imply λ i (1 ηi E[q (X)gi(X)|X Si]) = 0 ; i [m] (160) Singling out λ i from expression of q (x)y in equation 158, q (x)gi(x) becomes q (x)gi(x) = exp(λ i + P j =i:gj(x)=gi(x) λ j)p(x)gi(x) exp(λ i + P j =i:gj(x)=gi(x) λ j)p(x)gi(x) + P y =gi(x) exp(P j:gj(x)=y λ j)p(x)y (161) or q (x)gi(x) = exp(λ i )fi(x)/(exp(λ i )fi(x) + 1 fi(x)) (162) where fi(x) = exp( X j =i:gj(x)=gi(x) λ j)p(x)gi(x)/( j =i:gj(x)=y λ j)p(x)y) (163) fi depends on λ j for j = i but dependence is dropped for convenience. The complementary slackness condition in equation 160 now becomes 1 ηi E exp(λ i )fi(X) exp(λ i )fi(X) + 1 fi(X)|X Si Let ℓ(t) = E h t fi(X) t fi(X)+1 fi(X)|X Si i . We have δ δtℓ(t) = E fi(X)(1 fi(X)) (tfi(X) + 1 fi(X))2 |X Si δt2 ℓ(t) = E 2(fi(X))2(1 fi(X)) (tfi(X) + 1 fi(X))3 |X Si (We assume p(x)y (0, 1) and thus fi(x) (0, 1) to get the above inequalities.) Therefore, ℓis a strictly increasing and concave function in t, and thus ℓ(exp(λ)) is strictly increasing in λ. Thus from complementary slackness condition in equation 164, if ℓ(exp(0)) 1 ηi then λ i = 0, else λ i = λ : ℓ(exp(λ)) = 1 ηi. Together this can be written as λ i = min λ 0 λ : ℓ(exp(λ)) 1 ηi (167) λ i = min λ 0 λ : E exp(λ)fi(X) exp(λ)fi(X) + 1 fi(X)|X Si Plugging back fi defined in equation 163, and using definition of pλ, this is λ i = min λ 0 λ : E[pλ(X)gi(X)] 1 ηi (169) where λi = λ, and λj = λ j for j = i. Or equivalently λ i = arg min λi 0 j=1 λj : ρ(gi, pλ) ηi (170) where λj = λ j for j = i. And thus combining for all i we get, λ = arg min λ 0 j=1 λj : ρ(gj, pλ) ηj j [m] (171) Published as a conference paper at ICLR 2025 Proposition B.7. Given the optimization problem λ = arg min λ 0 j=1 λj : ρ(gj, pλ) ηj j [m] (172) The solution found by the following alternating minimization procedure is optimal if the outer loop converges (The inner loop (equation 174) always converges). Start with λ = 0. Then repeat until convergence: 1. Pick j [m], either randomly or in a round robin fashion. 2. Solve for λj = arg min λ λ : ρ(gj, pλ j) ηj (173) where λ j = [λ1 . . . λj 1, 0, λj+1, λm] as follows: Start: λj = 0 Repeat until convergence: λj λj + log(1 + (1 ηj τj(hj))+/τj(hj(1 hj))) τj(h) = E[h(X)|X Sj] (174) where hj(x) = pλ(x)gj(x). Proof. First, we show that the solution to the constrained optimization in equation 173 with respect to λ for a fixed λ j can be found by the iterative algorithm described in equation 174. The LHS of the constraint in equation 173, ρ(gj, pλ j) is a function of the form E[exp(λ)fj(X)/(exp(λ)fj(X) + 1 fj(X))|X Sj] (175) where fj(x) = exp( X k =j:gk(x)=gj(x) λk)p(x)gj(x)/( X k =j:gk(x)=y λk)p(x)y) which is strictly increasing and concave in exp(λ) as proved in equation 165. Suppose E[hj(X; λt, λ j)|X Sj] < 1 ηj. Then λj can also be written as λj = min λ 0 λ : E[hj(X; λt + λ, λ j)|X Sj] 1 ηj hj(x; λt + λ, λ j) = exp(λ)hj(x; λt, λ j)/(exp(λ)hj(x; λt, λ j) + 1 hj(x; λt, λ j)) Letting µ(s) = E[hj(X; log s + λt, λ j)|X Sj]. This is of the form in 175 and is strictly increasing and concave in s as proved in equation 165, and equation 166 respectively. λj is thus given as exp(s), s > 1 s.t. µ(s) = 1 ηj. If a differentiable function µ is strictly increasing and concave we have for any s1, s2 µ(s1) < µ(s1 + (µ(s2) µ(s1))/µ (s1)) < µ(s2) (176) Observe that s = s1 + (µ(s2) µ(s1))/µ (s1) is also one iteration of the Newton-Raphson method to solve for s : µ(s ) = µ(s2) starting with s = s1. Using s1 = 1 and s2 : µ(s2) = 1 ηi, we get µ(1) < µ(1 + (1 ηi µ(1))/µ (1)) < 1 ηi (177) We have µ(1) = E[hj(X; λt, λ j)|X Sj] and µ (1) = E[hj(X; λt, λ j)(1 hj(X; λt, λ j))|X Sj] Thus λt+1 = λt + log 1 + (1 ηi E[hj(X; λt, λ j)|X Sj]) E[hj(X; λt, λ j)(1 hj(X; λt, λ j))|X Sj] Published as a conference paper at ICLR 2025 satisfies E[hj(X; λt, λ j)|X Sj]) < E[hj(X; λt+1, λ j)|X Sj]) < 1 ηi (179) In particular given λt, we apply one step update of Newton-Raphson method on E[hj(X; λt, λ j)|X Sj] to get λt+1 and thus subsequently use hj(; λt+1, λ j) to apply the next update. Convergence of these updates is faster than directly applying Newton s method on the starting function E[hj(X; λ0 = 0, λ j)|X Sj] which is exp(λt+1) = exp(λt) + 1 + (1 ηi E[hj(X; λt, λ j)|X Sj]) E[hj(X; λt, λ j)(1 hj(X; λt, λ j))|X Sj] λt+1 = λt + log 1 + exp( λt)(1 ηi E[hj(X; λt, λ j)|X Sj]) E[hj(X; λt, λ j)(1 hj(X; λt, λ j))|X Sj] This results in a lower update since λt > 0 or exp( λt) < 1. Now we show that if the alternating minimization converges it leads to the optimal solution λ . Suppose the algorithm converges but the solution λ found is not optimal. This implies that λ satisfies all of the constraints, but the KKT complementary slackness condition given in equation 164 is violated. Since E[hj(X; λ, λ j)|X Sj] is a strictly increasing function in λ, and λj solves for λ in equation 173 given λ j, if λj > 0, then the equality must be satisfied, otherwise λj can be reduced (since λj > 0) without violating the inequality constraint. Thus either E[hj(X; λ, λ j)|X Sj] = 1 ηj or λj = 0 which satisfies the KKT complementary slackness condition. Published as a conference paper at ICLR 2025 C COMPACT VERSION OF OUR ALGORITHM Algorithm 1 Learning from weak labeler constraints Input: Unlabeled data {xi}n i=1, weak labelers {gj}m j=1 and the corresponding error bounds {ηj}m j=1. A smooth real-valued function f : X RK. wij := gj(xi) [K] { } // Weak labels cj := P i sij // Coverage for each epoch do zi f(xi) // Logits pi σ(zi) // Probabilities ej (1 ηj P i:wij = pi/cj)+ // Error in constraints dj P i:wij = pi(1 pi)/cj λj log(1 + ej/dj) // One step parallel newton update to find λ (equation 22) (hi)k (zi)k + P j λj I(wij = k) // Logits of projected distribution (equation 22) loss 1 n Pn i=1 ℓce(zi, σ(hi)) // Cross entropy loss f update(f, R(f) + α loss) // Gradient descent update (equation 10) end for Published as a conference paper at ICLR 2025 10 20 40 80 # constraints 0.007 0.011 0.016 0.024 0.004 0.004 0.007 0.011 0.001 0.001 0.001 0.002 0.000 0.000 0.000 0.000 # samples = 100 0.009 0.007 0.009 0.017 0.001 0.000 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 10 20 40 80 # constraints 0.009 0.010 0.023 0.028 0.003 0.005 0.005 0.010 0.001 0.002 0.002 0.003 0.000 0.000 0.000 0.000 # samples = 200 0.007 0.006 0.009 0.015 0.001 0.003 0.000 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 10 20 40 80 # constraints 0.012 0.013 0.032 0.027 0.003 0.006 0.007 0.012 0.002 0.002 0.002 0.003 0.000 0.000 0.000 0.000 # samples = 400 0.005 0.006 0.009 0.015 0.002 0.001 0.003 0.004 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 2 4 8 16 # classes 10 20 40 80 # constraints 0.007 0.019 0.017 0.024 0.004 0.005 0.009 0.010 0.002 0.001 0.002 0.004 0.001 0.000 0.000 0.000 # samples = 800 j (gj , q) j 2 4 8 16 # classes 0.003 0.005 0.010 0.014 0.002 0.002 0.003 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 j (gj , q) j Figure 3: Max error in satisfying convex approximation of constraints (maxj ρdist(gj, q) ηj), (left) vs max error in satisfying classifier constraints (maxj ρ(gj, clf(q)) ηj), (right) averaged over 100 runs of random draws of p and weak labelers, where q = π(p, Qdist), i.e. the projection q is found by projecting p on constraints on distribution as described in section 3.2. D ADDITIONAL RESULTS ON SOLVING THE ILP D.1 SOLVING THE ILP THROUGH LP For the particular ILP in equation 13, we have empirically found that the optimal solution of the relaxed LP is mostly integral, and approximating the non-integer by its integral solution almost always satisfies the constraints (the error in constraint satisfaction ρclf(gj, clf(q)) ηj). In Figure 4 and 5, we show the performance of solving the ILP after LP relaxation. We used scipy.linprog in Python to solve the LP. In particular, Figure 4 shows that the constraints are satisfied well (the error ρclf(gj, clf(q)) ηj is small), and 5 shows that only a small fraction of the variables (one variable for each unlabeled data x) are non-integral. Published as a conference paper at ICLR 2025 10 20 40 80 # constraints 0.004 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # samples = 100 0.002 0.002 0.002 0.002 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # samples = 200 2 4 8 16 # classes 10 20 40 80 # constraints 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # samples = 400 2 4 8 16 # classes 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # samples = 800 Figure 4: Max error in satisfaction of classifier constraints (maxj ρ(gj, clf(q)) ηj), averaged over 100 runs of random draws of p and weak labelers, where q = π(p, Qclf), i.e. the projection q is found by projecting p on the constraints on classifier through LP relaxation of the ILP described in 3.5. 10 20 40 80 # constraints 0.010 0.010 0.004 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # samples = 100 0.007 0.011 0.005 0.006 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # samples = 200 2 4 8 16 # classes 10 20 40 80 # constraints 0.005 0.006 0.005 0.004 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # samples = 400 2 4 8 16 # classes 0.003 0.004 0.003 0.002 0.003 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 # samples = 800 Figure 5: Fraction of non-integral points after solving the LP relaxed version of ILP averaged over 100 runs of random draws of p and weak labelers. Published as a conference paper at ICLR 2025 2 4 8 16 32 64 # of classes Max Ratio of KL Divergence Epsilon Values = 0.1 = 0.2 = 0.3 = 0.4 Methods Approximation Gradient Descent Figure 6: Max of DKL(h||p)/DKL(h ||p) over 1000 uniformly random draws of p. h denotes the solution of 182 found by approximation (equation 184) or gradient descent (20 steps after initial approximation) and h denotes the optimal solution as found by cvxpy.CLARABEL convex solver. D.2 ESTIMATION OF THE COST VECTOR Proj KL(p(x); K y,ϵ) and q (x) from equation 14, and equation 15 requires solving an optimization problem. We expand the DKL term to have i=k hi log(hk/p(x)k) s.t. hy (1 + ϵ)hk, k = y (182) which can be solved by a black box convex solver. However, it can also be reparameterized in λ RK 1 + to get an unconstrained optimization as, hk = exp( λk)/Z for k = y, hy = (1 + ϵ)/Z; Z = X The objective is no longer convex in λ but projected gradient descent (projection to RK 1 + ) can still be used to find a local minima. In particular, we find the approximation λi = log(max j p(x)j/p(x)i) (184) yields h for which DKL(h||p(x)) is empirically found to be within a small constant factor from the optimal. This can further be used as a starting point for gradient descent, which then results in a solution close to optimal. In Figure 6, we show empirically that our approximation is within a constant factor from the optimal as found by cvxpy.CLARABEL convex solver in Python, and a few iterations of projected gradient descent after this approximation find a solution very close to the optimal. Published as a conference paper at ICLR 2025 E CONVEX APPROXIMATION OF THE CONSTRAINT ON CLASSIFIER WITH THE CONSTRAINT ON DISTRIBUTION We remarked that the constraint sets Qj,dist that impose constraints on distribution, i.e. Qj,dist = {q X K | ρdist(gj, q) ηj} (185) can be considered as a convex approximation to the constraint sets Qj,clf that impose constraints on classifiers, i.e. Qj,clf = {q X K | ρclf(gj, clf(q)) ηj} (186) First, we note that for distributions that are very confident on one label, i.e. for any x, there is a y, such that p(x)y > 1 ϵ for a small ϵ, we can quantify the extent of approximation. Lemma E.1. If for any x there is a y, such that p(x)y > 1 ϵ, then ρclf(gj, clf(p)) ρdist(gj, p)/(1 ϵ) and ρdist(gj, p) ρclf(gj, clf(p))(1 ϵ) + ϵ Thus any p that satisfies ρclf(gj, clf(p)) (ηj ϵ)/(1 ϵ) lies in Qj,dist and any p Qj,dist satisfies ρclf(gj, clf(p)) ηj/(1 ϵ). ρdist(g, p) = E[ X k =g(X) p(X)k|X S, clf(p)(X) = g(X)] Pr(clf(p)(X) = g(X)|X S) k =g(X) p(X)k|X S, clf(p)(X) = g(X)] Pr(clf(p)(X) = g(X)|X S) ϵ(1 ρclf(g, clf(p))) + ρclf(g, clf(p)) ρdist(g, p) ρclf(g, clf(p))(1 ϵ) + ϵ ρdist(g, p) 0 + (1 ϵ)ρclf(g, clf(p)) ρclf(g, clf(p)) ρdist(g, p)/(1 ϵ) We dropped subscript j for convenience. Substituting ρclf(gj, clf(p)) (ηj ϵ)/(1 ϵ) in the bound for ρclf, we get ρdist(gj, p) ηj. Similarly, substituting ρdist(gj, p) ηj in the bound for ρclf, we get ρclf(gj, clf(p)) ηj/(1 ϵ). The above lemma suggests that if the true distribution makes confident predictions, (which is often the case for many real-world classification setting), then imposing a constraint on the distribution is close to imposing a constraint on classifier. However, this is only a sufficient condition and the convex approximation can be good without such a condition as well. In Figure, 3 we show the maximum error in satisfaction of any classifier constraint maxj ρclf(gj, clf(q)) ηj for p sampled uniformly at random when we use the method for projecting on constraints on distribution, i.e. q = π(p, Qdist). The data for weak labelers was generated randomly by first drawing Sj such that |Sj| N(0.6, 0.3) and error bound ηj N(0.3, 0.3) and then choosing weak labels uniformly at random such that their expected error in Sj is ηj. To estimate the projection π(p, Qdist), 20 steps of alternating minimization in a round-robin fashion (iterating over all constraints one by one) is performed to find λ (proposition B.7), where each step for each constraint i is optimized for 10 steps to find λi given current the iterate λ i. F ESTIMATING ACCURACY BOUND FROM THE VALIDATION SET Our algorithm needs some estimate of the error bounds ηj. Often they are not given but we have a small validation set with true labels, which can be leveraged to get bounds. One can be tempted to estimate ηj for each labeler gj separately from available labels in its coverage region Sj. This approach is problematic since many weak labelers have a small coverage set Sj such Published as a conference paper at ICLR 2025 that it is unlikely to have coverage in the validation set. Indeed if the validation set was large enough to have good coverage for all weak labelers, supervised learning may perform well enough rendering any additional information provided by weak labeler constraints uninformative. We consider two ways to alleviate this problem of low coverage. First, we consider a prior distribution on weak labeler errors ηj and then compute a Bayesian posterior from the labeled set (and use its mean as the estimate of the bound ηj). When there are no labels in the coverage, the posterior remains the same as the prior. One example of a prior is a Beta prior β(α, β) under which observing k incorrect out of n samples induces the posterior with mean (k + α)/(n + α + β). This can also be viewed as domain experts providing a general prior belief about errors instead of individual error estimates/ bounds for each weak labeler. Second, we consider a shared parameterization of errors as ηj = hϕ(gj, Sj) and then treat ϕ as a hyperparameter to be tuned on the validation set. A simple instance of this shared parameterization is a common bound ηj = ϕ , i.e. hϕ( ) = ϕ. This can also be viewed as domain experts providing relative information about errors, such as the average errors of different weak labelers are similar/correlated. Published as a conference paper at ICLR 2025 G ADDITIONAL EXPERIMENT DETAILS G.1 DATASET STATISTICS Dataset Num classes Num labelers Mean Accuracy Mean Coverage Youtube 2 10 0.83 0.14 0.17 0.09 IMDB 2 5 0.71 0.15 0.24 0.31 Bioresponse 2 20 0.64 0.09 0.17 0.20 Yelp 2 8 0.74 0.09 0.18 0.20 CDR 2 33 0.72 0.13 0.06 0.12 Chemprot 10 26 0.47 0.17 0.06 0.07 Semeval 9 164 0.77 0.40 0.01 0.03 Trec 6 68 0.73 0.37 0.03 0.08 Table 2: Dataset statistics includes the number of classes, the number of weak labelers, and the mean accuracy and coverage of the weak labelers. G.2 ERROR IN SATISFACTION OF CONSTRAINTS Table G.2 shows the errors in satisfaction of classifier constraints for our trained classifier f clf, when projecting on Qclf (row Ourclf(T) in table 1) as j clf, and f dist when projecting on Qdist (row Our(T) in table 1) as j dist. j clf = (ρclf(gj, f clf)) ηj)+ (187) j dist = (ρclf(gj, f dist)) ηj)+ (188) j=1 wj j (189) where wj = |Sj|/ j=1 |Sj| (190) For most datasets, we see that almost all constraints are satisfied. Despite projecting on Qdist, Bio CDR Chem IMDB Semeval Trec Yelp Youtube # Classes 2 2 10 2 9 6 2 2 # Labelers 20 33 26 5 164 68 8 10 meanj j dist 0.00.0 0.00.0 0.00.0 0.020.02 0.00.0 0.00.0 0.00.0 0.00.0 meanj j clf 0.00.0 0.00.0 0.020.01 0.050.01 0.00.0 0.020.01 0.010.0 0.010.01 maxj j dist 0.080.15 0.070.07 0.190.29 0.110.05 0.00.01 0.120.06 0.030.01 0.010.01 maxj j clf 0.030.01 0.050.05 0.280.22 0.120.03 0.050.03 0.550.29 0.060.01 0.040.03 Table 3: Mean and Max error in satisfaction of constraints j dist which measures error in satisfying classifier constraint is small, indicating that projecting on distribution works well empirically for satisfying constraints on classifier. G.3 EXPERIMENT DETAILS OF SOLVING THE ILP In table 1, we also reported our results for the method of projecting on the constraint on classifiers through solving the ILP in 3.5, as Ourclf(T). We did not have access to the true conditional distribution, and so we could not estimate accuracy bounds with respect to the target Bayes optimal classifier equation 1. Instead, we use the same bounds as estimated from training labels by interpreting them as noisy bounds with respect to the target classifier. Published as a conference paper at ICLR 2025 We use the python function scipy.linprog to solve the ILP and estimate the projection. ϵ in 12 was set to 0.2. The cost vector Proj KL(p(X); K y,ϵ) of the ILP is estimated as described in D.2, in particular we use the approximation λk = log(maxj p(x)j/p(x)k) for λk in equation 183. To limit the number of constraints in the linear program, we use stochastic gradient descent with a batch size of 1000. We take a single gradient step in each iteration of the alternating minimization in 3.2. Published as a conference paper at ICLR 2025 H ERROR GUARANTEE INSIDE THE COVERAGE OF WEAK LABELERS Theorem H.1. Let H1, . . . , Hk be constraints of the form Hj = {h : X [K] | ρclf(gj, h) ηj}, each with different constant weak labels, gj(X) = j 1 for all j = 1, . . . , K, then for any f T j Hj, we have j=1 (ηj + err Sj(gj))Pr(Sj) σ {1,...,k}, |σ| 2 (2|σ| 3)Pr(Bσ) Pr(S) . (191) when S = Sk j=1 Sj, Bσ = (T i σ Si) (T i {1,...,k}\σ Sc i ) are minterms. Proof. (Theorem 4.5 Let Bσ = (T i {1,...,k}\σ Sc i ) be minterms. For simplicity, we denote Bj = B{j} and write Pr(f(X) = f (X); B) = Pr(X {x X | f(x) = f (x)} B). Since the weak labelers gj(x) are constant and different for each gj, without loss of generality, we assume that gj(x) = j 1 for any x Sj and for all j = 1, . . . , k. Intuitively, Bj are regions in Sj with no other conflicting weak labels. We observe that err S(f) Pr(X S) (192) = Pr(f(X) = f (X); S) (193) j=1 Pr(f(X) = f (X); Bj) + Pr(f(X) = f (X); S j=1 Bj) (Bi Bj = , i = j) j=1 Pr(f(X) = j 1; Bj) + Pr(f (X) = j 1; Bj) + Pr(X S j=1 Bj) ( inequality j Hj, for j = 1, . . . , k, we have Pr(f(X) = j 1; Sj) ηj Pr(X Sj) (196) Pr(f(X) = j 1; Bj) + Pr(f(X) = j 1; Sj Bj) ηj Pr(X Sj) (197) Pr(f(X) = j 1; Bj) ηj Pr(X Sj) Pr(f(X) = j 1; Sj Bj). (198) Similarly, since f T j Hj, for j = 1, . . . , k, we also have Pr(f (X) = j 1; Bj) err Sj(gj) Pr(Sj) Pr(f (X) = j 1; Sj Bj). (199) Substitute these inequalities to the above, we have err S(f) Pr(X S) j=1 (ηj + err Sj(gj)) Pr(X Sj) Pr(f(X) = j 1; Sj Bj) (200) Pr(f (X) = j 1; Sj Bj) + Pr(X S j=1 Bj) (201) The region Sj Bj is the area in Sj with at least one other conflicting region, we can write this as a combination of minterms that represent each weak label in each region, σ {1,...,k}, j σ,|σ| 2 Published as a conference paper at ICLR 2025 j=1 Pr(f(X) = j 1; Sj Bj) (203) j=1 Pr(X Sj Bj) Pr(f(X) = j 1; Sj Bj) (204) j=1 Pr(X Sj Bj) σ {1,...,k}, j σ,|σ| 2 Pr(f(X) = j 1; Bσ) (from equation 202) (205) j=1 Pr(X Sj Bj) X σ {1,...,k}, |σ| 2 j σ Pr(f(X) = j 1; Bσ) (206) j=1 Pr(X Sj Bj) X σ {1,...,k}, |σ| 2 Pr(f(X) + 1 σ; Bσ) (label starts from 0) (207) j=1 Pr(X Sj Bj) X σ {1,...,k}, |σ| 2 Pr(X Bσ) (208) The second to last step holds from the fact that for each σ, we will see Bσ, |σ| times for each element j σ in terms of Pr(f(X) = j 1; Bσ). The same argument also holds for f . Substitute this back to the equation equation 200 err S(f) Pr(X S) j=1 (ηj + err Sj(gj)) Pr(X Sj) 2 j=1 Pr(X Sj Bj) (209) σ {1,...,k},|σ| 2 Pr(X Bσ) + Pr(X S j=1 Bj). (210) Finally, we can write each term as a combination of Bσ, j=1 Bj) = X σ {1,...,k}, |σ| 2 Pr(X Bσ), (211) j=1 Pr(X Sj Bj) = X σ {1,...,k}, |σ| 2 |σ| Pr(X Bσ) (212) Substitute in, we have the result, err S(f) Pr(X S) j=1 (ηj + err Sj(gj)) Pr(X Sj) X σ {1,...,k}, |σ| 2 (2|σ| 3) Pr(X Bσ). or in short, we write j=1 (ηj + err Sj(gj))Pr(Sj) σ {1,...,k}, |σ| 2 (2|σ| 3)Pr(Bσ) Pr(S) . (214) Published as a conference paper at ICLR 2025 I EXTENDING THE ERROR GUARANTEE TO THE AREA OUTSIDE OF THE WEAK LABELERS COVERAGE WITH A SMOOTHNESS PROPERTY To prove Theorem 4.6, we rely on the following observation. Lemma I.1. Let f be an L-Lipschitz function w.r.t. a metric d. For any distribution P, Q we have EX P [f(X)] EY Q[f(Y )] Wd(P, Q) (215) where Wd(P, Q) is a Wasserstein distance between P, Q where the cost function is given by d. Proof. For any distribution P, Q and a transport plan T : X X between P, Q, we have EX P [f(X)] EY Q[f(Y )] = Z f(x)p(x)dx Z f(y)q(y)dy (216) = Z Z f(x)T(x, y)dydx Z Z f(y)T(x, y)dxdy (217) = Z Z |f(x) f(y)|T(x, y)dydx L Z Z d(x, y)T(x, y)dydx. (218) When T is the optimal transport plan, the right-hand side term is the smallest, which is the definition of the Wasserstein distance Wd(P, Q). The following Theorem is a generalization of Theorem 4.6. Theorem I.2. Let P be a joint distribution over X Y such that Pr(Y = y | X = x) is L-Lipschitz w.r.t. a metric d where for all k = 1, . . . , K and x1, x2 X, | Pr(Y = k | X = x1) Pr(Y = k | X = x2)| Ld(x1, x2), (219) For any set S and f we write an error of f with respect to P as f err S(f) = Pr(f(X) = Y | X S), then for any subsets C, D X, we have g err D(f) g err C(f)(1 + tf(C, D)) + Luf(C, D) (220) tf(C, D) = sup k |Pr C(f(X) = k | X C) Pr(f(X) = k | X D) Pr(f(X) = k | X C) | (221) and uf(C, D) = X j Pr D (f(X) = k)Wd(PX|C {x|f(x)=k}, PX|D {x|f(x)=k}), (222) Wd is a Wasserstien distance and we denote PX|A for a distribution PX condition on X A. Proof. First, we write an error term as a combination of errors for each prediction of f(x), f err D(f) = Pr(f(X) = Y | X D) (223) k=1 Pr(f(X) = Y | f(X) = k, X D) Pr(f(X) = k | X D) (224) k=1 Pr(Y = k | f(X) = k, X D) Pr(f(X) = k | X D). (225) We observe that | Pr(Y = k | f(X) = k, X D) Pr(Y = k | f(X) = k, X C)| (226) = | Pr(Y = k | f(X) = k, X D) Pr(Y = k | f(X) = k, X C)| (227) = |EX PX|D {x|f(x)=k}[Pr(Y = k | X)] EX PX|C {x|f(x)=k}[Pr(Y = k | X)]| (228) LWd(PX|D {x|f(x)=k}, PX|C {x|f(x)=k}) (229) Published as a conference paper at ICLR 2025 In the second to last line, we can think of Pr(Y = k|X) as a function of X for which we can then apply Lemma I.1 to achieve the last line. For compactness of our notation, we denote Pr C( ) = Pr X( | X C). Substitute this back to the equation above, we have k=1 Pr C (Y = k | f(X) = k) Pr D (f(X) = k) (230) k=1 |Pr D (Y = k | f(X) = k) Pr C (Y = k | f(X) = k)| Pr D (f(X) = k) (231) k=1 Pr C (Y = k | f(X) = k) Pr D (f(X) = k) (232) k=1 LWd(PX|D {x|f(x)=k}, PX|C {x|f(X)=k}) Pr D (f(x) = k) (233) k=1 Pr C (Y = k | f(X) = k) Pr D (f(X) = k) + Luf(C, D). (234) Finally, we also observe that k=1 Pr C (Y = k | f(X) = k) Pr D (f(X) = j) (235) k=1 Pr C (Y = k | f(X) = k) Pr C (f(X) = k)(1 + Pr D(f(X) = k) Pr C(f(X) = k) Pr C(f(X) = k) ) (236) k=1 Pr C (Y = k | f(X) = k) Pr C (f(X) = k)|1 + Pr D(f(X) = k) Pr C(f(X) = k) Pr C(f(X) = k) | (237) k=1 Pr C (Y = k | f(X) = k) Pr C (f(X) = k))(1 + sup k |Pr D(f(X) = k) Pr C(f(X) = k) Pr C(f(X) = k) |) =f err C(f)(1 + tf(C, D)) (239) Combine with the above, we have g err D(f) g err C(f)(1 + tf(C, D)) + Luf(C, D) (240)