# label_propagation_with_weak_supervision__7cb7f3a3.pdf

Published as a conference paper at ICLR 2023

LABEL PROPAGATION WITH WEAK SUPERVISION

Rattana Pukdee , Dylan Sam , Maria-Florina Balcan, Pradeep Ravikumar Machine Learning Department Carnegie Mellon University Pittsburgh, USA {rpukdee , dylansam}@cs.cmu.edu

Semi-supervised learning and weakly supervised learning are important paradigms that aim to reduce the growing demand for labeled data in current machine learning applications. In this paper, we introduce a novel analysis of the classical label propagation algorithm (LPA) (Zhu & Ghahramani, 2002) that moreover takes advantage of useful prior information, specifically probabilistic hypothesized labels on the unlabeled data. We provide an error bound that exploits both the local geometric properties of the underlying graph and the quality of the prior information. We also propose a framework to incorporate multiple sources of noisy information. In particular, we consider the setting of weak supervision, where our sources of information are weak labelers. We demonstrate the ability of our approach on multiple benchmark weakly supervised classification tasks, showing improvements upon existing semi-supervised and weakly supervised methods.

1 INTRODUCTION

High-dimensional machine learning models require large labeled datasets for good performance and generalization. In the paradigm of semi-supervised learning, we look to overcome the bottleneck of labeled data by leveraging large amounts of unlabeled data and assumptions on how the target predictor behaves over the unlabeled samples. In this work, we focus on the classical semi-supervised approach of label propagation (LPA) (Zhu & Ghahramani, 2002; Zhou et al., 2003). This method propagates labels from labeled to unlabeled samples, under the assumption that the target predictor is smooth with respect to a graph over the samples (that is frequently defined by a euclidean distance threshold or nearest neighbors). However, in practice, to satisfy this strong assumption, the graph can be highly disconnected. In these cases, LPA performs well locally on regions connected to labeled points, but has low overall coverage as it cannot propagate to points beyond these connected regions.

In practice, we also have additional side-information beyond such smoothness of the target predictor. One concrete example of side information comes from the field of weakly supervised learning (WSL) (Ratner et al., 2016; 2017), which considers learning predictors from domain knowledge that takes the form of hand-engineered weak labelers. These weak labelers are heuristics that provide multiple weak labels per unlabeled sample, and the focus in WSL is to aggregate these weak labels to produce noisy pseudolabels for each unlabeled sample. In practice, weak labelers are typically not designed to be smooth with respect to a graph, even though the underlying target predictor might be. For example, weak labelers are commonly defined as hard, binary predictions, with an ability to abstain from predicting. We thus see that LPA and WSL have complementary sources of information, as smoothing via LPA can improve the quality of weak labelers. By encouraging smoothness, predictions near multiple abstentions can be made more uncertain, and abstentions can be converted into predictions by confident nearby predictions.

In this paper, we first bolster the theoretical foundations of LPA in the presence of side information. While LPA has a strong theoretical motivation of leveraging smoothness of the target predictor, there is limited theory on how accurate the propagated labels actually are. As a key contribution of this paper, we provide a fine-grained theory of LPA when used with any general prior on the target classes of the unlabeled samples. We provide a novel error bound for LPA, which depends on key

Equal contribution

Published as a conference paper at ICLR 2023

local geometric properties of the graph, such as underlying smoothness of the target predictor over the graph, and the flow of edges from labeled points, as well as the accuracy of our prior. Our bound provides an intuition as to when LPA should prioritize propagating label information or when it should prioritize using prior information. We provide a comparison of our error bound to an existing spectral bound (Belkin & Niyogi, 2004) and demonstrate that our bound is preferable in some examples.

Next, we propose a framework for incorporating multiple sources of noisy information to LPA by extending a framework from Zhu et al. (2003). We construct additional dongle nodes in the graph that correspond to individual noisy labels. With these additional nodes, we connect them to unlabeled points that receive noisy predictions and perform label propagation on this new graph as usual. We study multiple different techniques for determining the weight on these additional edges.

Finally, we focus on the specific case when our side information comes from WSL. We provide experimental results on standard weakly supervised benchmark tasks (Zhang et al., 2021) to support our theoretical claims and to compare our methods to standard LPA, other semi-supervised methods, and existing weakly supervised baselines. Our experiments demonstrate that incorporating smoothness via LPA in the standard weakly supervised pipeline leads to better performance, outperforming many existing WSL algorithms. This supports that there are significant benefits to combining LPA and WSL, and we believe that this intersection is a fertile ground for future research.

1.1 RELATED WORK

Label propagation Many papers have studied LPA from a theoretical standpoint. LPA has various connections to random walks, spectral clustering (Zhu et al., 2003), manifold learning (Belkin & Niyogi, 2004; Belkin et al., 2006) and network generative models (Yamaguchi & Hayashi, 2017), graph conductance (Talukdar & Cohen, 2014). Another line of research in LPA proposes using prior information at the initialization of LPA (Yamaguchi et al., 2016; Zhou et al., 2018), with applications in image segmentation (Vernaza & Chandraker, 2017), distant supervision (Bing et al., 2015), and domain adaptation (Cai et al., 2021; Wei et al., 2020). Finally, as the graph has a large impact on the performance of LPA, another line of work studies how to optimize the construction of the graph with linear-based (Wang & Zhang, 2007) methods, manifold-based (Karasuyama & Mamitsuka, 2013) methods, or deep learning based methods (Liu et al., 2018; 2019).

Weakly supervised learning The field of (programmatic) weakly supervised learning provides a framework for creating and combining hand-engineered weak labelers (Ratner et al., 2016; 2017; 2019; Fu et al., 2020) to pseudolabel unlabeled data and train a downstream model. Recent advances in weakly supervised learning extend the setting to include a small set of labeled data. One recent line of work has considered constraining the space of possible pseudolabels via weak labeler accuracies (Arachie & Huang, 2019; Mazzetto et al., 2021a;b; Arachie & Huang, 2021; 2022). Other works improve the aggregation scheme (Xu et al., 2021) or the weak labelers (Awasthi et al., 2020). We note that only one method incorporates any notion of smoothness into the weakly supervised pipeline (Chen et al., 2022). This work leverages the smoothness of pretrained embeddings in clustering. While clustering and LPA have similar intuitions, they result in fundamentally different notions of smoothness. We also remark that this paper does not consider the semi-supervised setting.

Semi-supervised learning Many other methods in semi-supervised learning look to induce smoothness in a learnt model. These include consistency regularization (Bachman et al., 2014; Sajjadi et al., 2016; Samuli & Timo, 2017; Sohn et al., 2020) and co-training (Blum & Mitchell, 1998; Balcan et al., 2004; Han et al., 2018). In addition, Graph Neural Networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017; Gilmer et al., 2017; Scarselli et al., 2008; Gori et al., 2005; Henaff et al., 2015) is a class of deep learning based methods that also operate over graphs. Some recent works (Huang et al., 2020; Wang & Leskovec, 2020; Dong et al., 2021) have made connections between graph neural networks and LPA. While all these methods focus on a similar goal of learning a smooth function, they do not address the weakly supervised setting.

2 PRELIMINARIES

We consider a binary classification setting where we want to learn a classifier f : X {0, 1}. We observe a small set of labeled data L = {(xi, yi)}n i=1 and a much larger set of unlabeled data U = {xj}n+m j=n+1. LPA relies on the assumption that nearby data points have similar labels. This is

Published as a conference paper at ICLR 2023

(a) Diagram of edges flow (b) In < Out (c) In > Out

Figure 1: A diagram of edges flow between neighborhoods of L. Color on each edge implies that the edge contributes to which flows (In, Between, Out) (left). Examples of graphs with different structure, where colored points represent labeled points (middle, right).

expressed in terms of smoothness with respect to an undirected graph G = (V, E), with |V | nodes representing each point x L U, and with an adjacency matrix W = (w)ij. LPA then leverages the assumption that adjacent points in this graph have similar labels, by propagating label information from L to U. Specifically, it learns f : X R by solving the following optimization problem:

min f Rn+m 1 2(

j=1 wij(fi fj)2) s.t. fi = yi for i n

where f Rn+m is the prediction vector and, abusing notation, fi = f(xi). The method generalizes to the multi-class setting by replacing yi with a one-hot-encoding vector, and predicting a score vector at each node. Zhu et al. (2003) provides a quick iterative method to solve this optimization problem.

3 LABEL PROPAGATION WITH PRIOR INFORMATION

We analyze LPA with initial noisy predictions h(x) : X [0, 1], by solving the following objective:

min f Rn+m 1 2(

j=1 wij(fi fj)2 + µ

i=1 (fi h(xi))2) s.t. fi = yi for i n, (1)

where µ R determines how much the solution is regularized to be close to h. In the standard LPA, we have no prior information on the unlabeled points, which can be seen as the case when h = 0.5 and µ 0. In our theory, h can be any general prior.

3.1 ERROR BOUND OF LPA WITH PRIOR INFORMATION

Similar to the standard LPA, there exists a closed form optimal solution of Equation 1, which is discussed in Appendix A. We know that, for the optimal solution of Equation 1, we can bound the error of a point i (|f i yi|) by the error of its neighbors and terms corresponding to the smoothness of the true labels and the prior information accuracy; we formally state this in Appendix B (Lemma 2). Because the error on labeled points are zero, we can bound the error in terms of the distance of a point to the nearest labeled point. For a set of labeled data L, let N(L) be a set of reachable points where there is at least one path from a point in L. Define a set of neighbors k-hops away from L as Nk(L) (i.e, a set of points whose shortest path to a point in L is length k). Let l be the number of hops required to cover N(L). Then, we have

Published as a conference paper at ICLR 2023

For simplicity, we denote Nk as Nk(L) and N0 as L. We now define terms that are fundamental to our error bound. First, we introduce notions of In-flow, Between-flow, and Out-flow, which represent the fraction of edges that flow in, between, and out of Nk(L).

Definition 1. For a graph G with an adjacency matrix W = (w)ij and a set of k-hop neighbors Nk, we define the In-flow, Between-flow and Out-flow of Nk as

i Nk,j Nk 1 wij, Cbet(k) = X

i Nk,j Nk wij,

Cout(k) = X

i Nk,j Nk+1 wij

These terms are related to the notion of conductance, which measures the fraction of out-going edges from any subset of nodes. We can write the Dirichlet conductance (Hao Chen et al., 2021) of a neighborhood Nk as follows

ϕ(Nk) = Cin(k) + Cout(k) Cin(k) + Cbet(k) + Cout(k).

Definition 2. (Ratio between Out-flow and In-flow)

γk = Cout(k) Cin(k) + µ|Nk|

γk is a proportion of the Out-flow and In-flow edges of a neighborhood (see Figure 1 for graphs with different flow). Next, we define the smoothness of Nk, prior information error, and average error.

Definition 3. (Smoothness of neighborhood) For 1 k l, we define the smoothness of true labels of points in Nk with respect to the graph as

j wij|yj yi|.

Definition 4. (Prior information error) For 1 k l, let the average error of the prior in Nk be

i Nk |hi yi|

Definition 5. (Average error) We define the average error at the Nk as

P i Nk |f i yi| |Nk| .

Theorem 1. (Informal version of Theorem 3) Let f be the optimal solution of the optimization problem of Equation 1, under Assumption 1 which assume that average error of a fraction of points that has Out connections from neighborhood Nk is of a constant factor of the average error in Nk (refer to Appendix B), the error of f in each neighborhood Nk is given by

j=k γj), ck = sk + µ|Nk|αk

Cin(k) + µ|Nk|.

Proof. (Sketch) The key idea of our proof is to upper bound each Ei for i {1, . . . , l} by exploiting the insight that we can bound the average error of a set Ni (points that are i hops away from labeled points) with the average errors of its neighbors Ni 1 and Ni+1 by using Lemma 2. We first bound E1 with E0 = 0 and E2, then we bound E2 with E1 and E3, and so on. See Appendix B for the full version of our proof.

Published as a conference paper at ICLR 2023

Figure 2: Example of graphs G1, G2, G3 (left, mid, right) to compare our bound to existing bounds. The background color represents the true label class, and colored points represents labeled points.

ck is a combination of smoothness sk and the prior accuracy αk, and µ controls the trade-off between using information from the graph or the initialization. When µ = 0, we recover the standard LPA without any prior. On the other hand, µ is equivalent to only using the initial predictions.

dk is a linear combination of ci for k i l where the coefficient of each ci is given by Qi 1 j=k γj, representing the influence from Ni. When γj < 1 ( In Out ), the influence is exponentially small while when γj > 1 ( Out In ) the influence can be exponentially large. This aligns with our intuition that when we have more In than Out , we will have a better guarantee. We remark that if ck = 0, regardless of the product Qi 1 j=k γj, ck will make no contribution to the bound.

To tighten this upper bound, we have to reduce both ck and γk. In doing so, the value of µ is important; a larger value of µ reduces both ck and γk by increasing their denominators. Thus, given similar levels of smoothness and prior accuracy ( sk Cin(k) αk), it is better to use a larger value of µ, that is we should rely more on the prior information.

The number of hops (k) from labeled points L also plays a key role in the bound. The upper bound on Ek is given by a linear combination of k terms, so points that are closer to L will have a smaller k and a better guarantee. This encourages us to have a more connected graph, requiring fewer hops to reach all points. However, adding noisy edges may potentially decrease the smoothness of the graph.

3.2 COMPARISON WITH PRIOR (SPECTRAL) BOUNDS

We compare our bound with an existing bound that relies on spectral analysis (Belkin & Niyogi, 2004). This bound is for LPA with a soft constraint, given by the problem

j=1 wij(fi fj)2 + η X

i n (fi yi)2 . (2)

We define the empirical error and generalization error as:

i=1 (fi yi)2 , R(f) = 1 n + m

i=1 (fi yi)2

As we do not have a hard constraint (fi = yi for i n), the empirical error is not necessary zero. Theorem 2. (Generalization performance of graph regularization (simplified version)) Let f be the optimal solution of Equation 2, n 4 be the number of randomly sampled labeled points from some graph G and λ1 be the second smallest eigenvalue of the Laplacian matrix of G. With probability 1 δ, we have

|Rn(f) R(f)| β +

β = 3η2 n (λ1 η)2 + 4η λ1 η

The original version of this bound as in (Belkin & Niyogi, 2004) is in Appendix C. We consider graphs G1, G2, G3 in Figure 2 to compare the bounds. Here v(G) refers to the value of parameter v for a graph G.

First, we note that this bounds the difference between the empirical error and the generalization error, while our bound is for the generalization error itself. The spectral bound assumes that we have

Published as a conference paper at ICLR 2023

Add dongle nodes

Figure 3: One can turn the label propagation with multiple sources of information into a standard label propagation problem by augmenting the graph G (left) with dongle nodes (right). The colored point represents a labeled point. The points without the shade are dongle nodes.

randomly sampled initial labeled points, and thus the bound only depend on the number of labeled points. For example, G2 and G3 have the same underlying graph and the same number of labeled points, so they have the same spectral bound of generalization error, which relies on the empirical error. In contrast, our bound takes the position of labeled points into account to provide an explicit explanation why LPA performs better on G2 than G3. We can see this since G2 is smoother than G3 (c2(G2) = 0, c2(G3) = s2(G3) Cin(2)(G3) = 8

The spectral bound depends on the second smallest eigenvalue λ1. If the graph is not well clustered, λ1 will be small. For example, λ1(G1) = 2, λ1(G2) = 0.53. Belkin & Niyogi (2004) suggests that when λ1 is small, we should cut the graph in two, using the eigenvector corresponding to λ1, and optimize the objective separately. Our bound works for any graph, in fact, our bound is also tight on G2 where LPA achieves zero error (as c1(G2) = c2(G2) = 0). Also, as η , the objective of Equation 2 is equivalent to Equation 1. The, the spectral bound takes on value β 3 n 4, which implies that it does not depend on the geometry of the graph (λ1) anymore. Finally, the spectral bound does not use any prior information, while our bound captures the interplay between the quality of graph and the quality of prior information.

4 LABEL PROPAGATION WITH MULTIPLE SOURCES OF INFORMATION

We now consider the setting where we observe multiple sources of prior information and provide a framework to incorporate them into LPA. Assume that we have multiple initial noisy predictions hi(x) : X [0, 1] for i = 1, 2, . . . , k. A natural extension of the LPA objective is given by

j=1 wij(fi fj)2 +

j=1 (fi hj(xi))2αj(xi) (3)

such that fi = yi for i n. The first term encourages our prediction to be smooth with respect to a graph while the second term encourage our prediction to also be close to the initial predictions. The function αj : X [0, ), which we need to learn, controls how close we want our final prediction to be to each initial prediction hj. We can turn this into a standard label propagation problem for which we have an efficient iterative method to solve by augmenting the graph G with dongle nodes (Appendix D). Given a fixed αj for each j = 1, . . . , k, we can also show that there exists an initial prediction h where the solution of Equation 3 is equivalent to a solution of LPA with a single initial prediction h for which our analysis applies (Appendix E).

A key question is for this framework is how to choose αj ? In the ideal setting, we set αj(xi) = 0 when hj makes an incorrect prediction for point xi and set αj(xi) = 1 when hj makes a correct prediction,

αj(xi) = 1[1[hj(xi) > 0.5] = yi].

However, this is not applicable in a practical setting as knowing when hj is correct or incorrect at a point xi is equivalent to knowing the corresponding true label yi of that point. We now investigate different approaches to select the function αj.

Published as a conference paper at ICLR 2023

4.1 ESTIMATED ACCURACY

Although we do not know whether hj will make a correct prediction at each point xi, we can still approximate its accuracy over the entire dataset. We can use techniques from crowd-sourcing literature or weak supervision literature to approximate the accuracy of each noisy labeler. Then, we can set αj as the estimated accuracy of a hj, αj(xi) = P(1[hj(x) > 0.5] = y) = pj.

We also consider setting αj = ln( pj 1 pj ) as in the boosting literature (in Appendix F.1).

4.2 PROBABILISTIC APPROACH

We can also consider a probabilistic approach to select the function αj. Let each hj be sampled from a Gaussian distribution hj N(y, σj(x)2) and let f follow the Gaussian field as in Zhu & Ghahramani (2002), where

ρβ(f) exp( βE(f)), E(f) = 1

j=1 wij(fi fj)2.

Then, the log-likelihood is given by

l(f) = constant

1 2σj(xi)2 (hj(xi) fi)2 β

j=1 wij(fi fj)2.

This resembles the objective of Equation 3 and suggests that we should set our function αj as

αj(xi) = 1 σj(xi)2 ,

where σj(xi)2 is the variance of hj at point xi. With access to a small set of labeled data points, we can estimate σj(xi) through heteroscedastic regression (Wasserman, 2006), which is further discussed in Appendix F.2. We note that this function αj changes over values of x as it is computed through regression, while the accuracy-based weighting has a constant value for αj.

5 EXPERIMENTS

We connect LPA and the field of weak supervision by using weak labelers as our source of prior information. Formally, a set of weak labelers is given by λ = {λ1, ..., λk}, where each λi : X {0, 1, } and denotes an abstention. Here, we consider LPA with a single source of prior information h(x) = hλ(x) is an aggregation of weak labelers, which we refer to as LPA+WL. For this method, we use Snorkel Me Ta L (Ratner et al., 2019) as our aggregation scheme. We also consider our extensions of LPA with multiple sources of prior information when we set hi = λi, for each weak labeler. We refer to our extensions of LPA that incorporate weak labelers through dongle nodes as LPAD (A) and LPAD (P), where the last letter denotes our techniques to estimate the weighted edges of these dongle nodes (accuracy, and probabilistic approach). For methods that require accuracies, we use accuracies estimated via Snorkel Me Ta L. We note that LPAD (A) and LPA+WL are both using the Snorkel estimated accuracy, we provide a discussion on their difference in Appendix E.1. For LPA + WL, we set µ = 1. Further experimental details for our methods and the baselines are in Appendix G.

We compare our approaches to existing weak supervision methods, standard LPA, and other semisupervised baselines on 4 binary classification datasets from the WRENCH benchmark (Zhang et al., 2021). The features from these text and image datasets are extracted from BERT (Kenton & Toutanova, 2019) and Res Net (He et al., 2016b) respectively. On each dataset, we balance the training data to have equal class proportions. To generate a small set of labeled data, we randomly sample n = 100 points from the training data. The remaining data serves as our unlabeled training data. For all graph-based methods, we construct a graph G with average degree t, which is a hyperparameter of our method, and with edges that have value 1. More information about t and other hyperparameters of all approaches are in Appendix G.2. Code to replicate our experiments can be found here1.

1https://github.com/dsam99/label propagation weak supervision

Published as a conference paper at ICLR 2023

Method Youtube SMS Basketball CDR

LPA 82.00 1.37 94.32 0.45 78.71 2.41 67.41 0.82 GCN 84.16 0.95 94.32 1.02 61.34 1.16 65.42 1.00 Snorkel + L 87.44 0.47 96.24 0.32 82.08 0.83 68.03 0.28 FS + L 87.76 0.51 94.84 0.43 70.23 1.20 67.70 0.29 CLL 88.56 0.80 94.56 0.73 77.02 3.96 68.52 0.58 Liger + L 88.72 0.58 96.08 0.38 80.98 1.71 67.33 0.18

LPA + WL 88.32 0.50 96.80 0.36 83.13 1.43 67.61 0.19 LPAD (A) 90.32 0.43 96.32 0.52 83.06 0.74 68.13 0.74 LPAD (P) 87.84 0.53 96.64 0.39 82.01 2.96 68.97 0.51

Table 1: We report accuracy ( s.e.) on a held-out test dataset for training an endmodel on pseudolabeled training data, when averaged over 5 seeds. We highlight the best performing method in red and the second best performing method in blue.

5.1 BASELINES

We compare our methods against various semi-supervised and existing weakly supervised learning approaches. We use ground truth labels instead of pseudolabels on the 100 labeled points in methods denoted with (+L). In all these approaches, we train an inductive endmodel on the pseudolabeled training data, as is standard in WSL literature.

Label Propagation (LPA): The standard label propagation baseline (Zhu & Ghahramani, 2002) on graph G. This does not take into account weak labeler information.

Graph Convolutional Network (GCN): We provide results for a standard graph convolutional network (Kipf & Welling, 2017). This method also does not take into account weak labeler information.

Snorkel + L: A weakly supervised learning aggregation scheme, Snorkel Me Ta L (Ratner et al., 2017; 2019), which produces pseudolabels through a graphical model.

Flying Squid + L (FS + L): Another weakly supervised method that estimates parameters of a graphical model via a triplet method (Fu et al., 2020).

Constrained Label Learning (CLL): A method that produces an labeling contained within a feasible space constrained by the error rates of weak labelers (Arachie & Huang, 2021). We use the small set of labeled data to generate the error rates of the weak labelers.

Liger + L: A method that extends weak labelers using the smoothness of pretrained models and develops cluster-level aggregations (Chen et al., 2022). This method uses FS (Fu et al., 2020) as a base aggregation scheme.

5.2 RESULTS

We provide results for test accuracy of training an endmodel on pseudolabels generated by the baselines and our methods in Table 1. Our results demonstrate that incorporating weak labels into label propagation improves upon the performance of LPA across all datasets (LPA + WL > LPA). In addition, in almost every dataset, using LPA to incorporate smoothness improves upon the prior aggregation of weak labels (LPA + WL > Snorkel + L). Our methods also outperform other weakly supervised aggregation methods, in most cases. The best performing baseline we compare to is Liger or CLL, which each only marginally outperforms some of our methods on one dataset. We note that there is not clear best weighting scheme between (A) and (P), although they outperform most baselines on almost all tasks. In addition, one of our methods is the best performing approach on all datasets.

We also report the accuracy of standard LPA and our methods on the labeled and unlabeled training data in Table 2 (i.e, evaluating pseudolabel accuracies). We measure the accuracy of a particular

Published as a conference paper at ICLR 2023

You Tube SMS CDR

Method Accuracy Coverage Accuracy Coverage Accuracy Coverage

Snorkel + L 75.96 0.13 89.75 0.08 70.40 0.34 48.31 0.52 70.64 0.12 92.41 0.11 LPA 55.98 0.08 11.97 0.16 54.71 0.01 9.42 0.03 50.79 0.00 1.58 0.00 Liger + L 81.06 0.47 99.98 0.01 78.62 0.23 96.01 0.20 50.56 0.13 81.79 10.72

LPA + WL 76.02 0.12 89.81 0.08 70.75 0.35 49.03 0.52 70.64 0.12 92.41 0.11 LPAD (A) 84.03 0.15 89.81 0.08 70.80 0.26 49.00 0.51 72.87 0.08 91.66 0.49 LPAD (P) 89.52 0.13 89.75 0.08 70.98 0.27 49.03 0.52 71.91 0.25 92.22 0.11

Table 2: We report accuracy and coverage ( s.e.) of the various label propagation methods on the full partially labeled training data (i.e, pseudolabel accuracies), when averaged over 5 seeds. We also add Liger + L as it looks to improve coverage by extending weak labelers. We bold the best performing method (in terms of Accuracy) on each dataset.

approach on abstained datapoints as 50% (i.e, random guessing on binary data) on all abstained points as the method has no information on these points. Results for additional datasets are deferred to Appendix 3; on these datasets, the weak labeler coverage is almost 100% of the data, so coverage is roughly the same across our methods and the baselines. We observe that coverage drastically increases when using our weakly supervised prior (Table 2) over the standard LPA and slightly over that of Snorkel. We also observe that our method improves upon the base aggregation method of Snorkel on almost all datasets, improving both overall accuracy and coverage due to the propagation of information to nearby points. We note that Liger has much higher coverage on You Tube and SMS, although the accuracy on this larger set is much worse (see Table 3 in the Appendix).

6 DISCUSSION

We provide a novel theoretical perspective on LPA that takes advantage of useful prior information. Our bound differs significantly from existing spectral bounds, and provides insight into how to best incorporate priors into LPA. We note that our analysis is general and works with any initialization h. We also provide a framework to handle multiple sources of side information and empirical results for the setting of weak supervision. Further work can incorporate other types of prior information into LPA, such as in the recent line of work of learning with past predictions (Mitzenmacher & Vassilvitskii, 2021; Khodak et al., 2022). In addition, our connections of LPA with weakly supervised learning illustrate (both theoretically and empirically) that these methods benefit each other. As a whole, our results support adding smoothness to the standard WSL pipeline and can encourage further connections between semi-supervised learning algorithms and WSL. We note a few limitations of our method; our bound depends on several parameters, smoothness sk, prior information accuracy αk; in general, we may need to approximate these values through labeled data. It remains an open question of how to do this effectively. In addition, we assume a uniformity assumption that the average error of points with Out connections from Nk is of a constant factor of the average error in Nk. Relaxing the bound beyond this assumption is also an open question.

We remark that our approach bridges the gap between classical label propagation and modern deep learning by incorporating information from pretrained models to construct our graph G. Since we construct G through Euclidean distance, our work uses notions of smoothness in the learnt embeddings, which is also noted in Chen et al. (2022). As we gain access to more powerful pretrained models, our approach will also benefit through a better graph G. This method also provides a natural framework to combine information from large pretrained models (via our graph G) and rules provided by domain experts (through our prior predictions h1, . . . , hk).

Published as a conference paper at ICLR 2023

ACKNOWLEDGEMENTS

This work was supported in part by NSF grants IIS-1909816, IIS-1955532, IIS-2211907, CCF1910321 and DARPA under cooperative agreement HR00112020003 and funding from Bosch Center for Artificial Intelligence and the ARCS Foundation.

Chidubem Arachie and Bert Huang. Stochastic generalized adversarial label learning. ar Xiv preprint ar Xiv:1906.00512, 2019.

Chidubem Arachie and Bert Huang. Constrained labeling for weakly supervised learning. In Uncertainty in Artificial Intelligence, pp. 236 246. PMLR, 2021.

Chidubem Arachie and Bert Huang. Data consistency for weakly supervised learning. ar Xiv preprint ar Xiv:2202.03987, 2022.

Abhijeet Awasthi, Sabyasachi Ghosh, Rasna Goyal, and Sunita Sarawagi. Learning from rules generalizing labeled exemplars. In International Conference on Learning Representations, 2020.

Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. Advances in neural information processing systems, 27, 2014.

Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co-training and expansion: Towards bridging theory and practice. Advances in neural information processing systems, 17, 2004.

Mikhail Belkin and Partha Niyogi. Semi-supervised learning on riemannian manifolds. Machine Learning, 56:209 239, 2004.

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7 (11), 2006.

Lidong Bing, Sneha Chaudhari, Richard C Wang, and William Cohen. Improving distant supervision for information extraction using label propagation through lists. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 524 529, 2015.

Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92 100, 1998.

Tianle Cai, Ruiqi Gao, Jason Lee, and Qi Lei. A theory of label propagation for subpopulation shift. In International Conference on Machine Learning, pp. 1170 1182. PMLR, 2021.

Mayee F Chen, Daniel Y Fu, Dyah Adila, Michael Zhang, Frederic Sala, Kayvon Fatahalian, and Christopher R e. Shoring up the foundations: Fusing model embeddings and weak supervision. ar Xiv preprint ar Xiv:2203.13270, 2022.

Hande Dong, Jiawei Chen, Fuli Feng, Xiangnan He, Shuxian Bi, Zhaolin Ding, and Peng Cui. On the equivalence of decoupled graph convolution network and label propagation. In Proceedings of the Web Conference 2021, pp. 3651 3662, 2021.

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. ISSN 00220000. doi: https://doi.org/10.1006/jcss.1997.1504. URL https://www.sciencedirect. com/science/article/pii/S002200009791504X.

Daniel Fu, Mayee Chen, Frederic Sala, Sarah Hooper, Kayvon Fatahalian, and Christopher R e. Fast and three-rious: Speeding up weak supervision with triplet methods. In International Conference on Machine Learning, pp. 3280 3291. PMLR, 2020.

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263 1272. PMLR, 2017.

Published as a conference paper at ICLR 2023

Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE international joint conference on neural networks, volume 2, pp. 729 734, 2005.

Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31, 2018.

Jeff Z Hao Chen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34, 2021.

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016a.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016b.

Mikael Henaff, Joan Bruna, and Yann Le Cun. Deep convolutional networks on graph-structured data (2015). ar Xiv preprint ar Xiv:1506.05163, 2015.

Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin Benson. Combining label propagation and simple models out-performs graph neural networks. In International Conference on Learning Representations, 2020.

Masayuki Karasuyama and Hiroshi Mamitsuka. Manifold-based similarity adaptation for label propagation. Advances in neural information processing systems, 26, 2013.

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171 4186, 2019.

Mikhail Khodak, Maria-Florina Balcan, Ameet Talwalkar, and Sergei Vassilvitskii. Learning predictions for algorithms with predictions. Advances in Neural Information Processing Systems, 2022.

Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=SJU4ay Ygl.

Bin Liu, Zhirong Wu, Han Hu, and Stephen Lin. Deep metric transfer for label propagation with limited annotated data. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0 0, 2019.

Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2018.

Alessio Mazzetto, Cyrus Cousins, Dylan Sam, Stephen H Bach, and Eli Upfal. Adversarial multi class learning under weak supervision with performance guarantees. In International Conference on Machine Learning, pp. 7534 7543. PMLR, 2021a.

Alessio Mazzetto, Dylan Sam, Andrew Park, Eli Upfal, and Stephen H. Bach. Semi-supervised aggregation of dependent weak supervision sources with performance guarantees. In Artificial Intelligence and Statistics (AISTATS), 2021b.

Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with predictions. In Tim Roughgarden (ed.), Beyond the worst-case analysis of algorithms, chapter 30. Cambridge University Press, 2021.

Published as a conference paper at ICLR 2023

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R e. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269 282, 2017.

Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher R e. Training complex models with multi-task weak supervision. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4763 4771, Jul. 2019.

Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R e. Data programming: Creating large training sets, quickly. Advances in neural information processing systems, 29, 2016.

Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, 29, 2016.

Laine Samuli and Aila Timo. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, volume 4, pp. 6, 2017.

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61 80, 2008.

Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33:596 608, 2020.

Partha Talukdar and William Cohen. Scaling graph-based semi supervised learning to large number of labels using count-min sketch. In Artificial Intelligence and Statistics, pp. 940 947. PMLR, 2014.

Paul Vernaza and Manmohan Chandraker. Learning random-walk label propagation for weaklysupervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7158 7166, 2017.

Fei Wang and Changshui Zhang. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering, 20(1):55 67, 2007.

Hongwei Wang and Jure Leskovec. Unifying graph convolutional neural networks and label propagation. Ar Xiv, abs/2002.06755, 2020.

Larry Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.

Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data. In International Conference on Learning Representations, 2020.

Yi Xu, Jiandong Ding, Lu Zhang, and Shuigeng Zhou. DP-SSL: Towards robust semi-supervised learning with a few labeled samples. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https:// openreview.net/forum?id=Nl Lyn LBBi01.

Yuto Yamaguchi and Kohei Hayashi. When does label propagation fail? a view from a network generative model. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 3224 3230, 2017.

Yuto Yamaguchi, Christos Faloutsos, and Hiroyuki Kitagawa. Camlp: Confidence-aware modulated label propagation. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 513 521. SIAM, 2016.

Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. WRENCH: A comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.

Published as a conference paper at ICLR 2023

Dengyong Zhou, Olivier Bousquet, Thomas Lal, Jason Weston, and Bernhard Sch olkopf. Learning with local and global consistency. Advances in neural information processing systems, 16, 2003.

Kuang Zhou, Arnaud Martin, Quan Pan, and Zhunga Liu. Selp: Semi-supervised evidential label propagation algorithm for graph data clustering. International Journal of Approximate Reasoning, 92:139 154, 2018.

Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Tech. Rep., Technical Report CMU-CALD-02 107, Carnegie Mellon University, 2002.

Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning, pp. 912 919, 2003.

Published as a conference paper at ICLR 2023

A CLOSED FORM SOLUTION OF LPA

We provide a closed form solution of the following optimization problem.

min f Rn+m 1 2( X

i,j wij(fi fj)2 + µ||f h||2 2) s.t. fi = yi for i n.

Here we abuse notation h as a vector (h(x1), . . . , h(xn+m)) and hi = h(xi). For simplicity we refer i L as 1 i n and i U as n + 1 i n + m. First, note that 1 2( X

i,j wij(fi fj)2 = 1

j L wij(fi fj)2 + 2 X

j U wij(fi fj)2

j U wij(fi fj)2)

The first term is a constant as fi = yi for i L. Denote f L Rn is a column vector with entry fi for i L and f U Rm is a column vector with entry fj for j U. We sometimes refer wi,j to wij. For the second term we have X

j U wij(fi fj)2 = X

j U wij(f 2 i 2fifj + f 2 j )

j U wij)f 2 i + X

i L wij)f 2 j 2 X

j U fiwijfj

= constant + f T UDULf U 2f T LWLUf U.

where DUL Rm m is a diagonal matrix with (DUL)jj = P

i L wi,j+n and WLU Rn m is a matrix with entry (WLU)ij = wi,j+n. For the third term, we have 1 2

j U wij(fi fj)2 = 1

j U wij(f 2 i 2fifj + f 2 j )

j U wij)f 2 i X

j U fiwijfj

= f T UDUUf U f T UWUUf U where DUURm m is a diagonal matrix with (DUU)jj = P

i U wi,j+n and WUU Rm m is a matrix with entry (WUU)ij = wi+n,j+n. Therefore, the overall objective is given by

min f U Rm constant + f T U(DUL + DUU WUU)f U 2f T LWLUf U + µ

2 ||f h||2 2.

Differentiating with respect to f U and setting equal to 0, we have

2(DUL + DUU WUU)f U 2W T LUf L + 2µ(f U h U) = 0

f U = (DUL + DUU WUU + µId) 1(µh U + W T LUf L)

when h U Rm with (h U)j = hj+n. We can also extend this to the case when µ Rm+n, where we have a different value of µi for each i. The optimization objective is given by

min f Rn+m 1 2( X

i,j wij(fi fj)2 + X

i µi(fi hi)2) s.t. fi = yi for i n.

We can write the regularization term for U as X

i U µi(fi hi)2 = (f U h U)T Dµ(f U h U)

when Dµ Rm is a diagonal matrix with entry (Dµ)jj = µj+n. We can write the optimization objective as

min f U Rm constant + f T U(DUL + DUU WUU)f U 2f T LWLUf U + (f U h U)T Dµ(f U h U).

Differentiating with respect to f U and setting equal to 0, we have

2(DUL + DUU WUU)f U 2W T LUf L + 2Dµ(f U h U) = 0

f U = (DUL + DUU + Dµ WUU) 1(Dµh U + W T LUf L)

Published as a conference paper at ICLR 2023

B THEORETICAL RESULTS

First, we analyze the closed form solution of the LPA in Lemma 1. Lemma 1. Let f be the optimal solution of the optimization problem of Equation 1 then for n + 1 i n + m,

j wijf j + µhi P

Proof. For each i n, we must have f i = yi to satisfy the hard constraints. For each n + 1 i n + m, we differentiate the objective with respect to fi and set equal to 0, resulting in X

j wij(fi fj) + µ(fi hi) = 0.

Rearranging this and setting fj = f j , we have the lemma.

Next, we will analyze the error |f i yi|, which is the difference between the optimal solution and the true label. Note that |f i yi| < 0.5 implies that we have a correct soft label f i . Lemma 2. Let f be the optimal solution of the optimization problem of Equation 1. Then, for n + 1 i n + m, we have

j wij|f j yj| + P

j wij|yj yi| + µ|hi yi| P

j wij + µ .

Proof. From lemma 1,

|f i yi| = |

j wijf j + µhi P j wij + µ yi|

P j wij(f j yi) + µ(hi yi) P j wij + µ |

P j wij(f j yj) + P j wij(yj yi) + µ(hi yi) P j wij + µ |

P j wij|f j yj| + P j wij|yj yi| + µ|hi yi| P

Lemma 2 says that we can bound the error of point i, |f i yi| by the error of its neighbors |f j yj| and terms corresponding to the smoothness of the true labels and the prior information accuracy. Because we know that the error on labeled points are zero, this lemma motivates us to bound the error in term of the distance of our points from the labeled points.

Next, in addition to the average error defined in Definition 5, we define the In-error, Between-error and Out-error for each Nk. Definition 6. For a graph G with an adjacency matrix W = (w)ij, a set of k-hop neighbors Nk and a prediction f Rn+m, we define the In-error, Between-error and Out-error of Nk as

Errin(f, y, k) =

i Nk,j Nk 1 wij|fi yi|

Errbet(f, y, k) =

i Nk,j Nk wij|fi yi|

Errout(f, y, k) =

i Nk,j Nk+1 wij|fi yi|

For simplicity, we will write Ein(k), Ebet(k) ,Eout(k) for Errin(f , y, k), Errbet(f , y, k), Errout(f , y, k) respectively.

Published as a conference paper at ICLR 2023

We will use Lemma 2, to derive a relationship between errors in Nk and its neighbors. We make use of the fact that the Out-flow of Nk is the same as the In-flow of Nk+1. Lemma 3. For 0 k l 1 Cout(k) = Cin(k + 1) Lemma 4. (Error difference inequality) For 1 k l 1, we have

Cin(k)(Ein(k) Eout(k 1)) + X

i Nk µ|f i yi| Cout(k)(Ein(k + 1) Eout(k)) + sk + µ|Nk|αk

where sk is the smoothness of true labels and αk is the prior information error over Nk.

Proof. From lemma 2, we have X

j wij|f i yi| + µ|f i yi| X

j wij|f j yj| + X

j wij|yj yi| + µ|hi yi|.

We take a summation over i Nk, X

j wij|f i yi| + µ|f i yi| X

j wij|f j yj| + X

j wij|yj yi| + µ|hi yi|).

From the definition of the In-error, Between-error, Out-error, smoothness sk, and weak label error αk, we have

LHS = Cin(k)Ein(k) + Cbet(k)Ebet(k) + Cout(k)Eout(k) + X

i Nk µ|f i yi|

RHS = Cout(k 1)Eout(k 1) + Cbet(k)Ebet(k) + Cin(k + 1)Ein(k + 1) + sk + µ|Nk|αk.

From lemma 3, we know that Cout(k) = Cin(k + 1), (4) so we can rearrange the inequality as

Cin(k)(Ein(k) Eout(k 1)) + X

i Nk µ|f i yi| Cout(k)(Ein(k + 1) Eout(k)) + sk + µ|Nk|αk.

Lemma 5. (Error difference inequality k = l)

Cin(l)(Ein(l) Eout(l 1)) + X

i Nl µ|f i yi| sl + µ|Nl|αl

Proof. Similar to lemma 4, we have X

j wij|f i yi| + µ|f i yi| X

j wij|f j yj| + X

j wij|yj yi| + µ|hi yi|)

Because, Nl is the last neighborhood, there is no edge out from Nl and

LHS = Cin(l)Ein(l) + Cbet(l)Ebet(l) + X

i Nl µ|f i yi|

RHS = Cout(l 1)Eout(l 1) + Cbet(l)Ebet(l) + sl + µ|Nl|αl

and rearrange to

Cin(l)(Ein(l) Eout(l 1)) + X

i Nl µ|f i yi| sl + µ|Nl|αl.

We can see that this inequality contains different notions of error. We now define the proportion between In-error and Out-error.

Published as a conference paper at ICLR 2023

Definition 7. Let ak, bk be the proportion of the In-error and Out-error with the average error,

ak = Ein(k)

Ek , bk = Ein(k)

i Nk |f i yi| |Nk| .

Assumption 1. (Uniformity of error) We assume that the In-error and Out-error are roughly the same as the average error in each neighborhood.

ak = O(1), bk = O(1), bk

For example, any graph G that has all points in a neighborhood Nk with the same number of edges that go into and out from that point, has the property that ak = bk = 1. In particular, assume that we have 2 points in Nk, the first point has 4 edges from Nk 1 and 2 edges to Nk+1 while the second point has 2 edges from Nk 1 and 1 edge to Nk+1, this graph still has ak = bk = 1. In general, we expect the proportion bk

ak to be close to 1. Next, we will substitute ak, bk in Lemma 4.

Corollary 1. For 1 k l 1, we have

(ak Ek bk 1Ek 1) Cout(k) Cin(k) + µ|Nk|(ak+1Ek+1 bk Ek) + sk + µ|Nk|αk

Cin(k) + µ|Nk|

Proof. From lemma 4

Cin(k)(Ein(k) Eout(k 1)) + X

i Nk µ|f i yi| Cout(k)(Ein(k + 1) Eout(k)) + sk + µ|Nk|αk

We let Ein(k) = ak Ek and Eout(k) = bk Ek and P i Nk |f i yi| = |Nk|Ek.

Cin(k)(ak Ek bk 1Ek 1) + µ|Nk|Ek Cout(k)(ak+1Ek+1 bk Ek) + sk + µ|Nk|αk Cin(k)(ak Ek bk 1Ek 1) + µ|Nk|(Ek Ek 1) Cout(k)(ak+1Ek+1 bk Ek) + sk + µ|Nk|αk,

as we know that Ek 1 0. Then, simplifying yields that

(Cin(k) + µ|Nk|)(ak Ek bk 1Ek 1) Cout(k)(ak+1Ek+1 bk Ek)) + sk + µ|Nk|αk

(ak Ek bk 1Ek 1) Cout(k) Cin(k) + µ|Nk|(ak+1Ek+1 bk Ek) + sk + µ|Nk|αk

Cin(k) + µ|Nk|.

Corollary 2. We have

(al El bl 1El 1) sl + µ|Nl|αl

Cin(l) + µ|Nl|.

The corollary implies that the difference between the error between neighborhood can t be too large. We introduce the next two lemma to help deriving the bound.

Lemma 6. For d1, d2, . . . , dl that satisfies the following inequalities,

dk γkdk+1 + ck

for 1 k l 1 and dl cl.

Published as a conference paper at ICLR 2023

Proof. The main idea is that we can use the upper bound on dl, dl 1, . . . , dk+1 to find the upper bound of dk. First, we start with dl 1

dl 1 γl 1dl + cl 1 γl 1cl + cl 1.

Next, we continue with dl 2,

dl 2 γl 2dl 1 + cl 2 γl 2(γl 1cl + cl 1) + cl 2. = clγl 1γl 2 + cl 1γl 2 + cl 2

and so on. With this idea, we can show by induction that

We sum these inequalities up to have the lemma.

Lemma 7. For x1, x2, . . . , xl that satisfies the following inequalities,

akxk bk 1xk 1 dk

for 1 k l, when ak, bk, dk are positive constant. We have

j=i δj)) + a1

Proof. We divide both side of the inequality by ak, for each 1 k l, we have

ak xk 1 + dk

We can recursively apply this inequality,

ak 1 xk 2 + dk 1

ak 1 ) + dk

ak 1 bk 2xk 2 + bk 1

ak 1 dk 1 + dk)

ak (δk 1bk 2xk 2 + δk 1dk 1 + dk)

ak (δk 1bk 2( bk 3

ak 2 xk 3 + dk 2

ak 2 ) + δk 1dk 1 + dk)

ak (δk 1δk 2bk 3xk 3 + δk 1δk 2dk 2 + δk 1dk 1 + dk)

j=i δj)) + a1

Now, we are ready to derive the error bound of LPA.

Published as a conference paper at ICLR 2023

Theorem 3. Let f be the optimal solution of the optimization problem of Equation 1, the error of f in each neighborhood is given by

ck = sk + µ|Nk|αk

Cin(k) + µ|Nk|, γk = Cout(k) Cin(k) + µ|Nk|.

Under assumption 1, we have

Proof. From corollary 1, 2 we have

(ak Ek bk 1Ek 1) Cout(k) Cin(k) + µ|Nk|(ak+1Ek+1 bk Ek) + sk + µ|Nk|αk

Cin(k) + µ|Nk|

(al El bl 1El 1) sl + µ|Nl|αl

Cin(l) + µ|Nl|.

dk = ak Ek bk 1Ek 1, ck = sk + µ|Nk|αk

Cin(k) + µ|Nk|, γk = Cout(k) Cin(k) + µ|Nk|

By lemma 6, we have

ak Ek bk 1Ek 1 = dk

By lemma 7, we have

j=i δj)) + a1

The last equality is true because the error E0 = 0. With the assumption 1, bk

ak = O(1), we have

Published as a conference paper at ICLR 2023

C SPECTRAL BOUND

The following is the original version of the spectral generalization bound found in Belkin & Niyogi (2004), where they assume that we can have repeated labeled points (at most u times).

Theorem 4. (Generalization performance of graph regularization) Let f be the optimal solution of Equation 2, n 4 be the number of randomly sampled labeled points from some distribution where each vertex occurs no more than u times, together with values y1, . . . , yn, |yi| M. Let λ1 be the second smallest eigenvalue of the Laplacian matrix of G. Assuming that x |f(x)| K, we have with probability 1 δ, (conditional on the multiplicity being no greater than t),

|Rn(f) R(f)| β +

n nβ + (K + M)2

β = 3η2 un (λ1 ηu)2 + 4ηM λ1 ηu

We can set u = 1, M = 1, K = 1 to achieve the simplified version (Theorem 2).

D DONGLE NODES

We can change a label propagation problem with multiple initial predictions into a standard label propagation by augmenting a graph with dongle nodes (Zhu et al., 2003). Without loss of generality, we assume that each initial prediction has 3 possible outputs hj : X { , 0, 1} for j = 1, . . . , k. However, this method also works for a general case when hj : X [0, 1]. We augment the original graph G with the following nodes and edges,

1. For each weak labeler hj, we add 2 nodes to the graph G with vertices vn+m+j, vn+m+k+j. This represents a prediction of class 0 or 1 from the weak labeler j.

2. For each xi that hj(xi) = 0, we draw a weighted edge between vi (the corresponding vertex of xi) and vn+m+j with weight αj(xi).

3. For each xi that hj(xi) = 1, we draw a weighted edge between vi (the corresponding vertex of xi) and vn+m+k+j with weight αj(xi).

4. For each xi that hj(xi) = , we do not draw any edge.

Let G be the new graph with a weighted adjacency matrix (w ij) then solving the objective of Equation 3 is equivalent to solving

min f Rn+m+2k

j=1 w ij(fi fj)2 (5)

1. fi = yi for i n.

2. fi = 0 for n + m + 1 i n + m + k.

3. fi = 1 for n + m + k + 1 i n + m + 2k.

We see initial predictions as dongle nodes and encode the parameter αj(xi) as a weight of an edge connecting the corresponding dongle node of predictor j to the node of xi. With a direct calculation, we can see that the objective of Equation 5 is the same as the original objective,

j=1 wij(fi fj)2 +

j=1 (fi hj(xi))2αj(xi)

such that fi = yi for i n. With this procedure, we add 2k nodes and at most (n + m)k edges to G.

Published as a conference paper at ICLR 2023

E CONNECTION BETWEEN LPA WITH MULTIPLE AND SINGLE INITIAL PREDICTIONS

We analyze the closed form solution of the optimization objective of Equation 3. By differentiating with respect to fi, we know that the optimal solution f i satisfies the following

j=1 2wij(f i f j ) +

j=1 2(f i hj(xi))αj(xi) = 0

Pn+m j=1 wijf j + Pk j=1 αj(xi)hj(xi) Pn+m j=1 wij + Pk j=1 αj(xi)

From Lemma 1, recall that the optimal solution of LPA with an initial prediction (objective of Equation 1) is given by

P j wijf j + µh(xi) P j wij + µ

We can see that if we set

Pk j=1 αj(xi)hj(xi)

Pk j=1 αj(xi) , µ(xi) =

j=1 αj(xi),

the solution LPA with multiple initial predictions is equivalent to LPA with the initial prediction h, which could be seen as a weighted average prediction. The objective is given by

min f Rn+m 1 2( X

i,j wij(fi fj)2 +

i=1 µ(xi)(fi h(xi))2) s.t. fi = yi for i n.

We note that now the parameter µ is now depends on each instance xi. When we have

j=1 αj(xi) = µ

is a constant for all xi then we will have the same setting as in the objective of Equation 1. However, our analysis still works in this case when µ(xi) is not a constant.

E.1 DIFFERENCE BETWEEN LPA+WL AND LPAD(A)

We note that LPAD (A) uses Snorkel to estimated accuracies αj. From above, LPAD (A) is equivalent to LPA with an initial prediction

Pk j=1 αjhj(xi)

Pk j=1 αj , µ(xi) =

We observe that h is exactly the prior information for LPA+WL. However, the key difference is that for LPA + WL, we have a fixed µ for all data points, while in LPAD(A), the value µ(xi) depends on xi. To illustrate this, we consider 2 scenarios. First, we assume that we have 3 weak labelers h1, h2, and h3, all with estimated accuracy 0.8. We consider a point x1 with h1(x1) = 1, h2(x1) = 1, h3(x1) = 1 and x2 with h1(x2) = 1, h2(x2) = , h3(x2) = , where is abstention. We can observe that

1. h(x1) = h(x2) = 1

2. µ(x1) = 2.4, µ(x2) = 0.8

Here in LPA + WL, x1, x2 have the same prior information and regularization parameter µ. In LPAD(A), we put much more weight on the regularization parameter µ(x1) than µ(x2). This is intuitive as we should be more confident about our prior information when a higher number of weak labelers agree.

Published as a conference paper at ICLR 2023

F METHODS FOR SELECTING ALPHA

F.1 BOOSTING APPROACH

From boosting literature (Freund & Schapire, 1997), given many weak learners hj : X {0, 1} for j = 1, 2, . . . , k, an optimal way to combine these weak learner (corresponding to an exponential loss upper bound) is a weighted average

Pk j=1 αjhj Pk j=1 αj , αj = ln( P(hj(x) = y) 1 P(hj(x) = y)),

Instead of accuracy, we could set αj in this fashion suggested by the boosting literature. We show that this value of αj minimizes the upper bound on the error |f i yi|. Recall that the optimal solution of Equation 3 satisfies

Pn+m j=1 wijf j + Pk j=1 αj(xi)hj(xi) Pn+m j=1 wij + Pk j=1 αj(xi) We can bound the error of a point i, |f i yi| by the error of its neighbor |f j yj|. Observe that

Pn+m j=1 wijf j + Pk j=1 αj(xi)hj(xi) Pn+m j=1 wij + Pk j=1 αj(xi) yi

Pn+m j=1 wij(f j yj + yj yi) + Pk j=1 αj(xi)(hj(xi) yi) Pn+m j=1 wij + Pk j=1 αj(xi)

Pn+m j=1 wij|f j yj| + Pn+m j=1 wij|yj yi| + | Pk j=1 αj(xi)(hj(xi) yi)| Pn+m j=1 wij + Pk j=1 αj(xi) ..

The first term represents errors of neighbor points |f j yj| and the second term represents the smoothness of the true labels on the graph G and the third term represents the accuracy of the weighted prediction. We can improve the upper bound by selecting appropriate value of αj(xi) and wij to minimize

Pk j=1 αj(xi)(hj(xi) yi)

Pk j=1 αj(xi) | |

Pk j=1 αj(xi)(hj(xi) yi) Pn+m j=1 wij + Pk j=1 αj(xi) |.

Consider the following lemma,

Lemma 8. Given k classifier hi : X { 1, 1} for i = 1, . . . , k. Let h(x) = Pk i=1 αihi(x) be the weighted average among the classifiers. Assume that the prediction of hi(x) are independent between different i . The optimal αi that minimize the risk when the loss is exponential loss of h(x), L(h, x, y) = exp( yh(x)) is given by

2 ln( P(hi(x) = y) 1 P(hi(x) = y))

Proof. The risk is given by E(L(h, x, y)) = E(exp( yh(x)))

i=1 αihi(x)))

i=1 E(exp( yαihi(x)))

i=1 pi exp( αi) + (1 pi) exp(αi)

Published as a conference paper at ICLR 2023

when pi = P(hi(x) = y). It is sufficient to choose αi that maximize

pi exp( αi) + (1 pi) exp(αi).

Differentiate with respect to αi and set to zero, we have

pi exp( αi) + (1 pi) exp(αi) = 0 (1 pi) exp(αi) = pi exp( αi)

exp(2αi) = pi 1 pi

2 ln( pi 1 pi )

Note that exponential loss is an upper bound of our hinge loss and Lemma 8 suggests that to minimize the exponential loss upper bound, we should set

2 ln( pi 1 pi )

F.2 HETEROSCEDASTIC REGRESSION

Recall that we model hj N(y, σ(x)2) so that we can write

hj(xi) = y(xi) + σj(xi)ε

when ε N(0, 1). We want to regress σ(x). Rearraging we have,

hj(xi) y(xi) = σj(xi)ε

(hj(xi) y(xi))2 = σj(xi)2ε2

log((hj(xi) y(xi))2) = log(σj(xi)2) + log(ε2)

On labeled data, we can regress a function gj(xi) to match log((hj(xi) y(xi))2) then we set

αj(xi) = 1 exp(gj(xi)).

G ADDITIONAL EXPERIMENTAL DETAILS

We use the default splits from the WRENCH benchmark (Zhang et al., 2021) for each of our binary classification dataset. This benchmark has a Apache-2.0 license. For each text classification dataset (Youtube, SMS, CDR), we use pretrained BERT embeddings (Kenton & Toutanova, 2019). For our image classification tasks (Basketball), we use pretrained Res Net embeddings (He et al., 2016a). For each task, we balance the datasets and randomly sample 100 labeled datapoints. We balance the datasets to make sure that the overall sample of labeled data contains roughly the same amount of points from each class.

We use cluster compute resources to produce our empirical results. We use a single GPU (NVIDIA Ge Force RTX 2080Ti) to run our methods and each of the baselines.

G.1 WEAK LABEL SOURCES

We use the standard weak labels contained within the WRENCH benchmark (Zhang et al., 2021). These are standard in programmatic weak supervision literature and primarily consist of simple hand-engineered rules. For example, on the You Tube dataset (or a spam classification task), examples of weak labels are functions that check for the presence of words in a sentence (Figure 4). We defer interested readers to the benchmark (Zhang et al., 2021) and other papers in weak supervision (Ratner et al., 2017) for more details.

Published as a conference paper at ICLR 2023

Figure 4: Examples of weak labels on the You Tube dataset

G.2 HYPERPARAMETER OPTIMIZATION

We perform hyperparameter optimization of all methods, selecting the best set of parameters on the validation set. We optimize over the following parameters for all methods endmodels:

learning rate: [0.01, 0.001, 0.0001]

number of epochs: [20, 30, 40, 50]

weight decay: [0, 0.01, 0.001]

In each experiment, we have a fixed batch size of 100 and a fixed architecture of a 2 layer neural network with a hidden dimension of 64 and a Re LU activation function. For all graph-based methods, we have an additional parameter t [1, 2, 5, 10, 100]. t controls the average degree of nodes in G. Let N be the number of nodes in G, we use the value t N and as our threshold percentile for our euclidean distance threshold graph. In essence, we add an edge between two points when the Euclidean distance between them is less than the t N -th percentile of all N 2 pairwise distances. The motivation for this is that N node corresponds to N 2 edges, so adding t N edges leads to a resulting graph with average degree t.

For our GCN baseline, we construct a graph G in the same manner as all other LPA-based methods. The GCN architecture is a 2 layer neural network with hidden dimension of 16 and Re LU activations. Consequently, we train an endmodel on the pseudolabeled data, which is the same architecture as all other methods. For our Liger + L baseline, we optimizer over a fixed threshold value for their cosine similarity as some k for each weak labeler. We note that this baseline is highly sensitive to the value of k; for BERT embeddings, we select values of k [0.995, 0.9975, 1] as points are much less distinguished in the embedding space in comparison the larger foundation models (GPT-3, CLIP) in the original paper (Chen et al., 2022). We use 2 clusters for all tasks.

H COMPLETE VERSION OF TABLES

We present our results for both training/pseudolabel performance (Table 3) and test/endmodel performance (Table 4) in more detail and with additional comparisons.

In our pseudolabel performance table, we provide the Non-Abstain accuracy (i.e, only considering accuracy on points on which the model makes a vote). We define abstaining as having a maximum logit (across either class) that is within ϵ = 0.001 of 0.5. We observe a fundamental tradeoff: balancing high accuracy and little coverage against lower accuracy and higher coverage. We also observe that LPA performs well locally on regions connected to labeled points with non-abstain accuracy close to 100 percents, but has low overall coverage.

For our endmodel results, we additionally compare against a fully supervised approach that uses all of the training data and their labels. We remark that some datasets in the WRENCH benchmark have noisy labels, leading to imperfect fully supervised performance. For example, we also add evaluations on the Tennis dataset, which only achieves 88% fully supervised performance, and all methods seem to match this performance. We also add some additional variations of our dongle-based approach. We add a comparison to a boosting (Appendix F.1) to determine α, which we refer to as LPAD (B). We also compare against a method that uses the unosberved ground truth accuracies for α (LPAD (O)) and another method that sets α = 1, x (LPAD (1)). We note that LPAD (O) is an unfair comparison

Published as a conference paper at ICLR 2023

You Tube SMS CDR

Method Acc Cov NA Acc Acc Cov NA Acc Acc Cov NA Acc

Snorkel + L 75.96 0.13 89.75 0.08 78.93 0.14 70.40 0.34 48.31 0.52 92.20 0.29 70.64 0.12 92.41 0.11 72.33 0.15 LPA 55.98 0.08 11.97 0.16 100.00 0.00 54.71 0.01 9.42 0.03 100.00 0.00 50.79 0.00 1.58 0.00 100.00 0.00 Liger + L 81.06 0.47 99.98 0.01 81.07 0.47 78.62 0.23 96.01 0.20 79.81 0.18 50.56 0.13 81.79 10.72 50.98 0.46

LPA + WL 76.02 0.12 89.81 0.08 78.97 0.14 70.75 0.35 49.03 0.52 92.32 0.29 70.64 0.12 92.41 0.11 72.33 0.15 LPAD (A) 84.03 0.15 89.81 0.08 87.89 0.16 70.80 0.26 49.00 0.51 92.45 0.10 72.87 0.08 91.66 0.49 74.95 0.17 LPAD (P) 89.52 0.13 89.75 0.08 94.03 0.11 70.98 0.27 49.03 0.52 92.79 0.13 71.91 0.25 92.22 0.11 73.75 0.25

LPAD (B) 75.87 0.14 89.81 0.08 78.80 0.16 70.83 0.24 48.93 0.05 92.58 0.16 70.40 0.08 91.64 0.50 72.27 0.13 LPAD (O) 89.36 0.08 89.81 0.08 93.8 0.10 71.52 0.30 49.00 0.51 93.91 0.19 73.91 0.07 92.23 0.09 75.93 0.09 LPAD (1) 83.11 0.07 76.97 0.11 93.03 0.07 70.88 0.26 48.64 0.51 92.93 0.09 71.39 0.07 75.00 0.11 78.52 0.09

Basketball Tennis

Method Acc Cov NA Acc Acc Cov NA Acc

Snorkel + L 69.09 0.02 100.00 0.0 69.09 0.02 86.10 0.06 100.00 0.0 86.10 0.06 LPA 62.16 0.02 25.82 0.05 97.12 0.17 70.60 0.29 52.61 0.34 89.16 0.44 Liger + L 59.06 2.53 55.83 9.88 65.21 1.17 83.60 1.34 100.00 0.00 83.60 1.34

LPA + WL 74.09 0.30 99.94 0.04 74.11 0.30 86.91 0.14 99.87 0.03 86.96 0.15 LPAD (A) 69.75 0.33 99.95 0.03 69.76 0.33 87.15 0.05 99.94 0.03 87.17 0.06 LPAD (P) 82.46 0.19 90.42 0.32 85.89 0.19 87.69 0.11 99.98 0.01 87.70 0.12

LPAD (B) 74.26 0.30 99.95 0.02 74.28 0.30 87.20 0.08 99.97 0.01 87.21 0.08 LPAD (O) 74.02 0.33 99.99 0.01 74.03 0.33 87.23 0.09 100.00 0.00 87.23 0.09 LPAD (1) 75.87 0.20 69.97 0.23 86.97 0.24 87.46 0.17 99.38 0.04 87.70 0.17

Table 3: We report accuracy, coverage, and non-abstaining accuracy (NA Acc) of the baselines and our variants of LPA on the training data (i.e, pseudolabel statistics), when averaged over 5 seeds.

Method Youtube SMS Basketball CDR Tennis

Snorkel + L 87.44 0.47 96.24 0.32 82.08 0.83 68.03 0.28 88.51 0.04 FS + L 87.76 0.51 94.84 0.43 70.23 1.20 67.70 0.29 88.56 0.02 CLL 88.56 0.80 94.56 0.73 77.02 3.96 68.52 0.58 88.78 0.13 LPA 82.00 1.37 94.32 0.45 78.71 2.41 67.41 0.82 83.35 3.30 GCN 84.16 0.95 94.32 1.02 61.34 1.16 65.42 1.00 88.63 0.14 Liger + L 88.72 0.58 96.08 0.38 80.98 1.71 67.33 0.18 86.43 0.87

LPA + WL 88.32 0.50 96.80 0.36 83.13 1.43 67.61 0.19 88.51 0.04 LPAD (A) 90.32 0.43 96.32 0.52 83.06 0.74 68.13 0.74 88.56 0.02 LPAD (P) 87.84 0.53 96.64 0.39 82.01 2.96 68.97 0.51 88.60 0.04

LPAD (B) 88.64 0.37 96.56 0.33 76.58 2.20 67.01 0.43 88.56 0.02 LPAD (O) 90.16 0.50 96.40 0.50 81.10 1.43 69.06 0.59 88.58 0.02 LPAD (1) 83.20 1.15 94.32 0.35 78.61 1.95 67.76 0.21 88.43 0.16 Fully Supervised 89.92 1.45 98.04 0.38 86.04 2.02 73.71 0.85 88.43 1.06

Table 4: We report accuracy on test data for training an endmodel on pseudolabeled training data, when averaged over 5 seeds. We bold baselines when they outperform both of our methods. We bold our methods when they outperform all baselines.

to all other methods as it accesses ground truth accuracies that other methods do not use; we add this comparison to describe the best potential performance of LPAD.

I HYPERPARAMETER ABLATION

We provide an ablation study on hyperparameter t. We report accuracy on the test data of an endmodel that is trained on pseudolabels from baselines and our methods with particular values of t to determine the construction of G. We observe that all graph-based methods are sensitive to the choice of t, which controls the sparsity of edges in the (Euclidean) graph G. We remark that this finding is intuitive as most graph-based semi-supervised algorithms leverage properties of this graph to achieve better

Published as a conference paper at ICLR 2023

performance. We can see a common trend among all methods where when t is large, the end-model accuracy tends to decrease. We note that Snorkel + L does not leverage any graph information, but we still add it here for comparison.

Method t = 1 t = 2 t = 5 t = 10 t = 100

Snorkel + L 87.04 0.47 85.44 0.9 86.4 0.78 86.4 0.78 87.04 0.47 LPA 79.76 1.99 81.6 1.15 82.0 1.37 82.64 1.62 75.68 1.51

LPA + WL 86.24 0.75 86.56 0.81 86.16 0.71 88.32 0.5 83.12 1.2 LPAD (A) 89.12 0.5 88.96 1.04 89.04 0.68 90.32 0.43 84.0 0.54 LPAD (B) 87.2 0.55 86.24 1.22 87.12 0.69 88.64 0.37 83.84 0.27 LPAD (P) 87.76 0.79 87.84 0.53 89.2 0.31 89.92 0.53 82.8 1.03

Table 5: We report accuracy on test data for training an endmodel on pseudolabeled training data from various label propagation methods when using different hyperparameter t and averaged over 5 seeds.

Method t = 1 t = 2 t = 5 t = 10 t = 100

Snorkel + L 95.04 0.46 96.08 0.48 96.24 0.32 96.24 0.32 95.04 0.46 LPA 94.32 0.45 94.52 0.26 95.24 0.38 92.2 1.08 80.76 3.51

LPA + WL 94.88 0.67 96.44 0.26 96.8 0.36 95.36 0.6 85.12 3.82 LPAD (A) 95.6 0.28 96.8 0.33 96.32 0.52 95.96 0.41 86.12 4.27 LPAD (B) 96.04 0.19 96.68 0.32 96.56 0.33 96.12 0.34 82.64 4.36 LPAD (P) 95.8 0.46 95.64 0.36 96.64 0.39 96.16 0.48 84.32 2.82

Table 6: We report accuracy on test data for training an endmodel on pseudolabeled training data from various label propagation methods when using different hyperparameter t and averaged over 5 seeds.

Method t = 1 t = 2 t = 5 t = 10 t = 100

Snorkel + L 67.87 0.25 68.03 0.28 68.24 0.72 68.24 0.72 67.87 0.25 LPA 67.01 0.82 65.17 1.09 63.19 1.18 59.33 1.8 44.39 2.69

LPA + WL 67.87 0.25 67.61 0.19 67.16 0.84 65.8 0.82 49.66 1.8 LPAD (A) 68.13 0.74 67.68 1.1 68.65 0.53 67.56 0.38 61.28 3.21 LPAD (B) 66.44 1.23 67.01 0.43 65.98 0.92 66.26 1.15 54.06 2.78 LPAD (P) 68.97 0.51 67.27 1.05 65.48 0.36 64.93 1.8 58.34 3.71

Table 7: We report accuracy on test data for training an endmodel on pseudolabeled training data from various label propagation methods when using different hyperparameter t and averaged over 5 seeds.

Published as a conference paper at ICLR 2023

Method t = 1 t = 2 t = 5 t = 10 t = 100

Snorkel + L 81.62 1.07 83.62 0.8 82.08 0.83 82.08 0.83 81.62 1.07 LPA 75.56 0.55 79.36 1.49 75.12 1.26 78.71 2.41 66.02 1.19

LPA + WL 83.13 1.43 80.87 0.64 79.95 1.56 80.11 1.75 73.45 1.09 LPAD (A) 80.44 0.83 78.9 1.53 81.64 1.35 83.06 0.75 74.83 1.49 LPAD (B) 75.81 3.47 72.75 2.11 76.58 2.2 78.0 4.34 69.36 1.03 LPAD (P) 82.01 2.96 74.6 2.05 73.31 5.25 68.71 4.41 67.18 3.64

Table 8: We report accuracy on test data for training an endmodel on pseudolabeled training data from various label propagation methods when using different hyperparameter t and averaged over 5 seeds.