# lifting_weak_supervision_to_structured_prediction__43ed2107.pdf

Lifting Weak Supervision To Structured Prediction

Harit Vishwakarma hvishwakarma@wisc.edu Frederic Sala fredsala@cs.wisc.edu

Department of Computer Sciences, University of Wisconsin-Madison, WI, USA.

Weak supervision (WS) is a rich set of techniques that produce pseudolabels by aggregating easily obtained but potentially noisy label estimates from a variety of sources. WS is theoretically well understood for binary classification, where simple approaches enable consistent estimation of pseudolabel noise rates. Using this result, it has been shown that downstream models trained on the pseudolabels have generalization guarantees nearly identical to those trained on clean labels. While this is exciting, users often wish to use WS for structured prediction, where the output space consists of more than a binary or multi-class label set: e.g. rankings, graphs, manifolds, and more. Do the favorable theoretical properties of WS for binary classification lift to this setting? We answer this question in the affirmative for a wide range of scenarios. For labels taking values in a finite metric space, we introduce techniques new to weak supervision based on pseudo-Euclidean embeddings and tensor decompositions, providing a nearly-consistent noise rate estimator. For labels in constant-curvature Riemannian manifolds, we introduce new invariants that also yield consistent noise rate estimation. In both cases, when using the resulting pseudolabels in concert with a flexible downstream model, we obtain generalization guarantees nearly identical to those for models trained on clean data. Several of our results, which can be viewed as robustness guarantees in structured prediction with noisy labels, may be of independent interest. Empirical evaluation validates our claims and shows the merits of the proposed method1.

1 Introduction

Weak supervision (WS) is an array of methods used to construct pseudolabels for training supervised models in label-constrained settings. The standard workflow [RSW+16, RBE+18, FCS+20] is to assemble a set of cheaply-acquired labeling functions simple heuristics, small programs, pretrained models, knowledge base lookups that produce multiple noisy estimates of what the true label is for each unlabeled point in a training set. These noisy outputs are modeled and aggregated into a single higher-quality pseudolabel. Any conventional supervised end model can be trained on these pseudolabels. This pattern has been used to deliver excellent performance in a range of domains in both research and industry settings [DRS+20, RNGS20, SLB20], bypassing the need to invest in large-scale manual labeling. Importantly, these successes are usually found in binary or small-cardinality classification settings.

While exciting, users often wish to use weak supervision in structured prediction (SP) settings, where the output space consists of more than a binary or multiclass label set [BHS+07, KL15]. In such cases, there exists meaningful algebraic or geometric structure to exploit. Structured prediction includes, for example, learning rankings used for recommendation systems [KAG18], regression in metric spaces [PM19], learning on manifolds [RCMR18], graph-based learning [GS19], and more.

1https://github.com/Sprocket Lab/WS-Struct-Pred

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

An important advantage of WS in the standard setting of binary classification is that it sometimes yields models with nearly the same generalization guarantees as their fully-supervised counterparts. Indeed, the penalty for using pseudolabels instead of clean labels is only a multiplicative constant. This is a highly favorable tradeoff since acquiring more unlabeled data is easy. This property leads us to ask the key question for this work: does weak supervision for structured prediction preserve generalization guarantees? We answer this question in the affirmative, justifying the application of WS to settings far from its current use.

Generalization results in WS rely on two steps [RHD+19, FCS+20]: (i) showing that the estimator used to learn the model of the labeling functions is consistent, thus recovering the noise rates for these noisy voters, and (ii) using a noise-aware loss to de-bias end-model training [NDRT13]. Lifting these two results to structured prediction is challenging. The only available weak supervision technique suitable for SP is that of [SLV+22]. It suffers from several limitations. First, it relies on the availability of isometric embeddings of metric spaces into Rd but does not explain how to find these. Second, it does not tackle downstream generalization at all. We resolve these two challenges.

We introduce results for a wide variety of structured prediction problems, requiring only that the labels live in some metric space. We consider both finite and continuous (manifold-valued) settings. For finite spaces, we apply two tools that are new to weak supervision. The approach we propose combines isometric pseudo-Euclidean embeddings with tensor decompositions resulting in a nearlyconsistent noise rate estimator. In the continuous case, we introduce a label model suitable for the so-called model spaces Riemannian manifolds of constant curvature along with extensions to even more general spaces. In both cases, we show generalization results when using the resulting pseudolabels in concert with a flexible end model from [CRR16, RCMR18].

Contributions:

New techniques for performing weak supervision in finite metric spaces based on isometric pseudo-Euclidean embeddings and tensor decomposition algorithms, Generalizations to manifold-valued regression in constant-curvature manifolds, Finite-sample error bounds for noise rate estimation in each scenario, Generalization error guarantees for training downstream models on pseudolabels, Experiments confirming the theoretical results and showing improvements over [SLV+22].

2 Background and Problem Setup

Our goal is to theoretically characterize how well learning with pseudolabels (built with weak supervision techniques) performs in structured prediction. We seek to understand the interplay between the noise in WS sources and the generalization performance of the downstream structured prediction model. We provide brief background and introduce our problem and some useful notation.

2.1 Structured Prediction

Structured prediction (SP) involves predicting labels in spaces with rich structure. Denote the label space by Y. Conventionally Y is a set, e.g., Y = { 1, +1} for binary classification. In the SP setting, Y has some additional algebraic or geometric structure. In this work we assume that Y is a metric space with metric (distance) d Y. This covers many types of problems, including

Rankings, where Y = Sρ, the symmetric group on {1, . . . , ρ}, i.e., labels are permutations, Graphs, where Y = Gρ, the space of graphs with vertex set V = {1, . . . , ρ}, Riemannian manifolds, including Y = Sd, the sphere, or Hd, the hyperboloid.

Learning and Generalization in Structured Prediction In conventional supervised learning we have a dataset {(x1, y1), . . . , (xn, yn)} of i.i.d. samples drawn from distribution ρ over X Y. As usual, we seek to learn a model that generalizes well to points not seen during training. Let F = {f : X 7 Y} be a family of functions from X to Y. Define the risk R(f) for f F and f as

X Y d2 Y(f(x), y)dρ(x, y) f arg min f F R(f). (1)

For a large class of settings (including all of those we consider in this paper), [CRR16, RCMR18] have shown that the estimator ˆf is consistent:

ˆf(x) = arg min y Y F(x, y) F(x, y) := 1

i=1 αi(x)d2 Y(y, yi), (2)

where α(x) = (K + νI) 1Kx. Here, K is the kernel matrix for a p.d. kernel k : X X R, so that Ki,j = k(xi, xj), (Kx)i = k(x, xi), and ν is a regularization parameter. The procedure here is to first compute the weights α and then to perform the optimization in (2) to make a prediction.

An exciting contribution of [CRR16, RCMR18] is the generalization bound

R( ˆf) R(f ) + O(n 1

that holds with high probability, as long as there is no label noise. The key question we tackle is does the use of pseudolabels instead of true labels yi affect the generalization rate?

2.2 Weak Supervision

In WS, we cannot access any of the ground-truth labels yi. Instead we observe for each xi the noisy votes λ1,i, . . . , λm,i. These are m weak supervision outputs provided by labeling functions (LFs) sa, where sa : X Y and λa,i = sa(xi). A two step process is used to construct pseudolabels. First, we learn a noise model (also called a label model) that determines how reliable each source sa is. That is, we must learn θ for Pθ(λ1, λ2, . . . , λm|y) without having access to any samples of y. Second, the noise model is used to infer a distribution (or its mode) for each point: Pθ(yi|λ1,i, . . . , λm,i).

We adopt the noise model from [SLV+22], which is suitable for our SP setting:

Pθ(λ1, . . . , λm|Y = y) = 1

a=1 θad2 Y(λa, y) X

(a,b) E θa,bd2 Y(λa, λb)

Z is the normalizing partition function, θ = [θ1, . . . , θm]T > 0 are canonical parameters, and E is a set of correlations. The model can be described in terms of the mean parameters E[d2 Y(λa, y)]. Intuitively, if θa is large, the typical distance from λa to y is small and the LF is reliable; if θa is small, the LF is unreliable. This model is appropriate for several reasons. It is an exponential family model with useful theoretical properties. It subsumes popular special cases of noise, including, for regression, zero-mean multivariate Gaussian noise; for permutations, a generalization of the popular Mallows model; for the binary case, it produces a close relative of the Ising model.

Our goal is to form estimates ˆθ in order to construct pseudolabels. One way to build such pseudolabels is to compute y = arg minz Y 1/m Pm a=1 ˆθad2 Y(z, λa). Observe how the estimated parameters ˆθa are used to weight the labeling functions, ensuring that more reliable votes receive a larger weight.

We are now in a position to state the main research question for this work:

Do there exist estimation approaches yielding ˆθ that produce pseudolabels y that maintain the same generalization error rate O(n 1/4) when used in (2), or a modified version of (2)?

3 Noise Rate Recovery in Finite Metric Spaces

In the next two sections we handle finite metric spaces. Afterwards we tackle continuous (manifoldvalued) spaces. We first discuss learning the noise parameters θ, then the use of pseudolabels.

Roadmap For finite metric spaces with |Y| = r, we apply two tools new to weak supervision. First, we embed Y into a pseudo-Euclidean space [Gol85]. These spaces generalize Euclidean space, enabling isometric (distance-preserving) embeddings for any metric. Using pseudo-Euclidean spaces make our analysis slightly more complex, but we gain the isometry property, which is critical.

Second, we form three-way tensors from embeddings of observed labeling functions. Applying tensor product decomposition algorithms [AGH+14], we can recover estimates of the mean parameters ˆE[d2 Y(λa, y)] and ultimately ˆθa. Finally, we reweight the model (2) to preserve generalization.

Tensor Decomposition Pseudo-Label Inference Pseudo-Euclidean

{(λa,i λb,i λm,i)}n

zi = Y|λa,i, λm,i; θ

End Model Training

{(xi, zi)}n

{(λa,i λb,i λm,i)}n

Figure 1: Illustration of our weak supervision pipeline for the finite label space setting.

The intuition behind this approach is the following. First, we need a technique that can provide consistent or nearly-consistent estimates of the parameters in the noise model. Second, we need to handle any finite metric space. Techniques like the one introduced in [FCS+20] handle the first but do not work for generic finite metric spaces, only binary labels and certain sequences. Techniques like the one in [SLV+22] handle any metric space but only have consistency guarantees in highly restrictive settings (e.g., it requires an isometric embedding, that the distribution over the resulting embeddings is isomorphic to certain distributions, the true label only takes on two values). Pseudo-Euclidean embeddings used with tensor decomposition algorithms meet both requirements

3.1 Pseudo-Euclidean Embeddings

Our first task is to embed the metric space into a continuous space enabling easier computation and potential dimensionality reduction. A standard approach is multi-dimensional scaling (MDS) [KW78], which embeds Y into Rd. A downside of MDS is that not all metric spaces embed (isometrically) into Euclidean space, as the square distance matrix D must be positive semi-definite.

A simple and elegant way to overcome this difficulty is to instead use pseudo-Euclidean spaces for embeddings. These pseudo-spaces do not require a p.s.d. inner product. As an outcome, any finite metric space can be embedded into a pseudo-Euclidean space with no distortion [Gol85] so that distances are exactly preserved. Such spaces have been applied to similarity-based learning methods [PPD01, LRBM06, PHD+06]. A vector u in a pseudo-Euclidean space Rd+,d has two parts: u+ Rd+ and u Rd . The dot product and the squared distance between any two vectors u, v are u, v ϕ = u+, v+ u , v and d2 ϕ(u, v) = ||u+ v+||2 2 ||u v ||2 2. These properties enable isometric embeddings: the distance can be decomposed into two components that are individually induced from p.s.d. inner products and can thus be embedded via MDS. Indeed, pseudo-Euclidean embeddings effectively run MDS for each component (see Algorithm 1 steps 4-9). To recover the original distance, we obtain ||u+ v+||2 2 and ||u v ||2 2 and subtract.

Example: To see why such embeddings are advantageous, we compare with a one-hot vector representation (whose dimension is |Y|). Consider a tree with a root node and three branches, each of which is a path with t nodes. Let Y be the nodes in the tree with the shortest-hops distance as the metric. The pseudo-Euclidean embedding dimension is just d = 3; see Appendix for more details. The one-hot embedding dimension is d = |Y| = 3t + 1 arbitrarily larger!

Now we are ready to apply these embeddings to our problem. Abusing notation, we write λa and y for the pseudo-Euclidean embeddings of λa, y, respectively. We have that d2 Y(λa, y) = d2 ϕ(λa, y), so that there is no loss of information from working with these spaces. In addition, we write the mean as µa,y = E[λa|y] and the covariance as Σa,y. Our goal is to obtain an accurate estimate ˆµa,y = ˆE[λa|y], which we will use to estimate the mean parameters E[d2 Y(λa, y)]. If we could observe y, it would be easy to empirically estimate µa,y but we do not have access to it. Our approach will be to apply tensor decomposition for multi-view mixtures [AGJ14].

3.2 Multi-View Mixtures and Tensor Decompositions

In a multi-view mixture model, multiple views {λa}m a=1 of a latent variable Y are observed. These views are independent when conditioned on Y . We treat the positive and negative components λ+ a Rd+ and λ a Rd of our pseudo-Euclidean embedding as separate multi-view mixtures:

λ+ a |y µ+ a,y + σ

d+ ϵ+ a and λ a |y µ a,y + σ

d ϵ a a [m], (4)

where µ+ a,y = E[λ+ a |y], µ a,y = E[λ a |y] and ϵ+ a , ϵ a are mean zero random vectors with covariances 1 d+ Id+, 1 d Id respectively. Here σ2 is a proxy variance whose use is described in Assumption 3.

Algorithm 1 Algorithm for Pseudolabel Construction Input: Labeling function outputs L = {(λ1,i, . . . , λm,i)}n i=1, Label Space Y = {y0, . . . , yr 1} Output: Pseudolabels for each data point Z = { zi}n i=1

Step 1: Compute pseudo-Euclidean Embeddings 1: Construct matrices D Rr r, Dij = d2 Y(yi, yj) and M Rr r, Mij = 1

2(D2 0i + D2 0j D2 ij) 2: Compute eigendecomposition of M and let M = UCUT

3: Set l+, l be indices of positive and negative eigenvalues sorted by their magnitude 4: Let d+ = |l+|, d = |l | i.e. the sizes of lists l+ and l respectively. 5: Construct permutation matrix Iperm Rr (d++d ) by concatenating l+, l in order 6: C = CIperm, U = UIperm 7: Y = UT C 1 2 Rr (d++d ) and let this define the mapping g : Y 7 Y

Step 2: Parameter Estimation Using Tensor Decomposition 8: for a 1 to m 3 do 9: Obtain embeddings λa,i = g(λa,i), λb,i = g(λb,i), λc,i = g(λc,i) i [n] where a, b, c are uncorrelated 10: Construct tensors ˆT+ and ˆT as defined in (5) for triplet (a, b, c) 11: ˆµ+ a,y, ˆµ+ b,y, ˆµ+ c,y = Tensor Decomposition( ˆT+)

12: ˆµ a,y, ˆµ b,y, ˆµ c,y = Tensor Decomposition( ˆT )

13: s+ a,y = minz { 1,+1} ϕ(z ˆµ+ a,y, y+) and similarly s+ b,y, s+ c,y, s a,y, s b,y, s c,y 14: ˆµ+ a,y = s+ a,y ˆµ+ a,y and similarly correct signs of ˆµ+ b,y, ˆµ+ c,y, ˆµ a,y, ˆµ b,y, ˆµ c,y 15: end for

Step 3: Infer Pseudo-Labels

16: Z(i) = zi Y |λa = λ(i) a , . . . λm = λ(i) m ; ˆθ

17: return { zi}n i=1

We cannot directly estimate these parameters from observations of λa, due to the fact that y is not observed. However, we can observe various moments of the outputs of the LFs such as tensors of outer products of LF triplets. We require that for each a such a triplet exists. Then,

T+ := E[λ+ a λ+ b λ+ c ] = X

y Ys wyµ+ a,y µ+ b,y µ+ c,y and ˆT+ := 1

i=1 λ+ a,i λ+ b,i λ+ c,i. (5)

Here wy are the mixture probabilities (prior probabilities of Y ) and Ys = {y : wy > 0}. We similarly define T and ˆT . We then obtain estimates ˆµ+ a,y, ˆµ a,y using an algorithm from [AGH+14] with minor modifications to handle pseudo-Euclidean rather than Euclidean space. The overall approach is shown in Algorithm 1. We have three key assumptions for our analysis,

Assumption 1. The support of PY , i.e., k = |{y : wy > 0}| and the label space Y is such that min(d+, d ) k, ||µ+ a,y||2 = 1, ||µ a,y||2 = 1 for a [m], y Y.

Assumption 2. (Bounded angle between µ and y) Let ϕ(u, v) denote the angle between any two vectors u, v in a Euclidean space. We assume that ϕ(µ+ a,y, y+) [0, π/2 c), ϕ(µ a,y, y ) [0, π/2 c) a [m], and y Ys, for some sufficiently small c (0, π/4] such that sin(c) max(ϵ0(d+), ϵ0(d )), where ϵ0(d) is defined for some n > n0 samples in (6).

Assumption 3. σ is such that the recovery error with model (4) is at least as large as with (3) .

These enable providing guarantees on recovering the mean vector magnitudes (1) and signs (2) and simplify the analysis (1), (3); all three can be relaxed at the expense of a more complex analysis.

Our first theoretical result shows that we have near-consistency in estimating the mean parameters in (3). We use standard notation O ignoring logarithmic factors.

Theorem 1. Let ˆµ+ a,y, ˆµ a,y be the estimates of µ+ a,y, µ a,y returned by Algorithm 1 with input ˆT+, ˆT

constructed using isometric pseudo-Euclidean embeddings (in Rd+,d ). Suppose Assumptions 1 and 2 are met, a sufficiently large number of samples n are drawn from the model in (3), and k = |Ys|. Then there exists a constant C0 > 0 such that with high probability a [m] and y Ys,

|θa ˆθa| C0 E[d2 Y(λa, y)] ˆE[d2 Y(λa, y)] ϵ(d+) + ϵ(d ),

k d if σ2 = Θ(1),

k d if σ2 = Θ( 1

We interpret Theorem 1. It is a nearly direct application of [AGJ14]. There are two noise cases for σ. In the high-noise case, σ is independent of dimension d (and thus |Y|). Intuitively, this means the average distance balls around each LF begin to overlap as the number of points grows explaining the multiplicative k term. If the noise scales down as we add more embedded points, this problem is removed, as in the low-noise case. In both cases, the second error term comes from using the algorithm of [AGH+14] and is independent of the sampling error. Since k = Θ(d), this term goes down with d. The first error term is due to sampling noise and goes to zero in the number of samples n. Note the tradeoffs of using the embeddings. If we used one-hot encoding, d = |Y|, and in the high-noise case, we would pay a very heavy cost for p

d/n. However, while sampling error is minimized when using a very small d, we pay a cost in the second error term. This leads to a tradeoff in selecting the appropriate embedding dimension.

4 Generalization Error for Structured Prediction in Finite Metric Spaces

We have access to labeling function outputs λa,i, . . . , λm,i for points xi and noise rate estimates ˆθa, . . . , ˆθm. How can we use these to infer unobserved labels y in (2)? Our approach is based on [NDRT13, v RW18],where the underlying loss function is modified to deal with noise. Analogously, we modify (2) in such a way that the generalization guarantee is nearly preserved.

4.1 Prediction with Pseudolabels

First, we construct the posterior distribution P ˆθ(Y = y|λ). We use our estimated noise model P ˆθ(λ|Y ) and the prior P(Y = y). We create pseudo-labels for each data point by drawing a random sample from the posterior distribution conditioned on the output of labeling functions: Z(i) = zi Y |λa = λ(i) a , . . . , λm = λ(i) m ; ˆθ. We thus observe (x1, z1), . . . , (xn, zn) where zi is sampled as above. To overcome the effect of noise we create a perturbed version of the distance function using the noise rates, generalizing [NDRT13]. This requires us to characterize the noise distribution induced by our inference procedure. In particular we seek the probability that Z = yj when the true label is yj. This can be expressed as follows. Let Ym denote the m-fold Cartesian product of Y and let Λu = (λ(u) 1 , . . . , λ(u) m ) denote its uth entry. We write

Pij = Pθ( Z = yj|Y = yi) =

u=1 Pθ(Y = yj|Λ = Λ(u)) Pθ(Λ = Λ(u)|Y = yi). (7)

We define Qij = P ˆθ( Z = yj|Y = yi) using ˆθ. P is the noise distribution induced by the true parameters θ and Q is an approximation obtained from inference with the estimated parameters ˆθ. With this terminology, we can define the perturbed version of the distance function and a corresponding replacement of (2):

dq(T, Y = yj) :=

i=1 (Q 1)jid2 Y(T, Y = yi) yj Y, (8)

Fq(x, y) := 1

i=1 αi(x) dq(y, zi) ˆfq(x) = arg min y Y Fq(x, y). (9)

We similarly define dp, Fp, ˆfp using the true noise distribution P. The perturbed distance dp is an unbiased estimator of the true distance. However we do not know the true noise distribution P hence we cannot use it for prediction. Instead we use dq. Note that dq is no longer an unbiased estimator its bias can be expressed as function of the parameter recovery error bound in Theorem 1.

4.2 Bounding the Generalization Error

What can we say about the excess risk R( ˆfq) R(f )? Note that compared to the prediction based on clean labels, there are two additional sources of error. One is the noise in the labels (i.e., even if we know the true P, the quality of the pseudolabels is imperfect). The other is our estimation procedure for the noise distribution. We must address both sources of error.

Our analysis uses the following assumptions on the minimum and maximum singular values σmin(P) , σmax(P) and the condition number κ(P) of true noise matrix P and the function F. Additional detail is provided in the Appendix.

Assumption 4. (Noise model is not arbitrary) The true parameters θ are such that σmin(P) > 0, and the condition number κ(P) is sufficiently small.

Assumption 5. (Normalized features) |α(x)| 1, for all x X.

Assumption 6. (Proxy strong convexity) The function F in (2) satisfies the following property with some β > 0. As we move away from the minimizer of F, the function increases and the rate of increase is proportional to the distance between the points:

F x, f(x) F x, ˆf(x) + β d2 Y f(x), ˆf(x) x X, f F. (10)

With these assumptions, we provide a generalization result for prediction with pseudolabels,

Theorem 2. (Generalization Error ) Let ˆf be the minimizer as defined in (2) over the clean labels and let ˆfq (defined in (9)) be the minimizer over the noisy labels obtained from inference in Algorithm

1. Suppose Assumptions 4,5,6 hold. Then for ϵ2 = k5/2 O(ϵ(d+) + ϵ(d )) 1 + κ(P) σmin(P) and

k σmin(P), with high probability,

R( ˆfq) R(f ) + O(n 1

Implications and Tradeoffs: We interpret each term in the bound. The first term is present even with access to the clean labels and hence unavoidable. The second term is the additional error we incur if we learn with the knowledge of the true noise distribution. The third term is due to the use of the estimated noise model. It is dominated by the noise rate recovery result in Theorem 1. If the third term goes to 0 (perfect recovery) then we obtain the rate O(n 1/4), the same as in the case of access to clean labels. The third term is introduced by our noise rate recovery algorithm and has two terms: one dominated by O(n 1/2) and the other on O(

k/d) (see discussion of Theorem 1). Thus we only pay an extra additive factor O(

k/d) in the excess risk when using pseudolabels.

5 Manifold-Valued Label Spaces: Noise Recovery and Generalization

We introduce a simple recovery method for weak supervision in constant-curvature Riemannian manifolds. First we briefly introduce some background notation on these spaces, then provide our estimator and consistency result, then the downstream generalization result. Finally, we discuss extensions to symmetric Riemannian manifolds, an even more general class of spaces.

Background on Riemannian manifolds The following is necessarily a very abridged background; more detail can be found in [Lee00, Tu11]. A smooth manifold M is a space where each point is located in a neighborhood diffeomorphic to Rd. Attached to each point p M is a tangent space Tp M; each such tangent space is a d-dimensional vector space enabling the use of calculus.

A Riemannian manifold equips a smooth manifold with a Riemannian metric: a smoothly-varying inner product , p at each point p. This tool allows us to compute angles, lengths, and ultimately, distances d M(p, q) between points on the manifold as shortest-path distances. These shortest paths are called geodesics and can be parametrized as curves γ(t), where γ(0) = p, or by tangent vectors v Tp M. The exponential map operation exp : Tp M 7 M takes tangent vectors to manifold points. It enables switching between these tangent vectors: expp(v) = q implies that d M(p, q) = v . The logarithmic map operation log : M 7 Tp M takes manifold points to tangent vectors. Further, expp(v) = q is equivalent to logp(q) = v.

Invariant Our first contribution is a simple invariant that enables us to recover the error parameters. Note that we cannot rely on the finite metric-space technique, since the manifolds we consider have an infinite number of points. Nor do we need an embedding we have a continuous representation as-is. Instead, we propose a simple idea based on the law of cosines. Essentially, on average, the geodesic triangle formed by the latent variable y M and two observed LFs λa, λb, is a right triangle. This means it can be characterized by the (Riemannian) version of the Pythagorean theorem: Lemma 1. For Y = M, a hyperbolic manifold, y P for some distribution P on M and labeling functions λa, λb drawn from (3), E cosh d Y(λa, λb) = E cosh d Y(λb, y)E cosh d Y(λb, y), while for Y = M a spherical manifold, E cos d Y(λa, λb) = E cos d Y(λb, y)E cos d Y(λb, y).

These invariants enable us to easily learn by forming a triplet system. Suppose we construct the equation in Lemma 1 for three pairs of labeling functions. The resulting system can be solved to express E[cosh(d Y(λa, y))] in terms of E cosh(d Y(λa, λb)), E cosh(d Y(λa, λc)), E cosh(d Y(λb, λc)). Specifically,

E cosh(d Y(λa, y)) =

E cosh d Y(λa, λb)E cosh d Y(λa, λc)

(E cosh(d Y(λb, λc))2 .

Note that we can estimate ˆE via the empirical versions of terms on the right , as these are based on observable quantities. This is a generalization of the binary case in [FCS+20] and the Gaussian (Euclidean) case in [SLV+22] to hyperbolic manifolds. A similar estimator can be obtained for spherical manifolds by replacing cosh with cos.

Using this tool, we can obtain a consistent estimator for θa for each of a = 1, . . . , m. Let C0 satisfy E|ˆE cosh(d Y(λa, λb)) E cosh(d Y(λa, λb))| C0E|ˆEd2 Y(λa, λb)) Ed2 Y(λa, λb)|; that is, C0 reflects the preservation of concentration when moving from distribution cosh(d) to d2. Then, Theorem 3. Let M be a hyperbolic manifold. Fix 0 < δ < 1 and let (δ) =

minρ Pr i, d Y(λa,i, λb,i) ρ 1 δ. Then, there exists a constant C1 so that with proba-

bility at least 1 δ, E|ˆEd2 Y(λa, y)) Ed2 Y(λa, y)| C1 cosh( (δ))3/2/C0

As we hoped, our estimator is consistent. Note that we pay a price for a tighter bound: (δ) is large for smaller probability δ. It is possible to estimate the size of (δ) (more generally, it is a function of the curvature). In addition, it is possible to replace the (δ) term by applying a version of Mc Diarmid s inequality for unbounded spaces as in [Kon14].

Next, we adapt the downstream model predictor (2) in the following way. Let ˆµ2 a = ˆE[d2 Y(λa, y)]. Let β = [β1, . . . , βm]T be such that P a βa = 1 and β minimizes P a β2 aˆµ2 a. Then, we set

f(x) = arg min y Y

a=1 β2 ad2 Y(y, λa,i).

We simply replace each of the true labels with a combination of the labeling functions. With this, we can state our final result. First, we introduce our assumptions.

Let q = arg minz Y E[α(x)(y)d2 Y(z, y)], where the expectation is taken over the population level distribution and α(x)(y) denotes the kernel at y. Assumption 7. (Bounded Hugging Function c.f. [Str20]) Let q be defined as above. For all a, b M, the hugging function at q is given by kb q(a) = 1 ( logq(a) logq(b) 2 d2 Y(a, b))/d2 Y(q, b). We assume that kb q(a) is lower bounded by kmin.

102 103 104 105 106

102 103 104 105 106

| [d2( a, y])] [d2( a, y)]|

200 400 600 800 1000 1200 1400 1600 1800 2000 n

Test accuracy

Ours (label model) UWS (label model)

Ours (end model) UWS (end model)

Figure 2: Finite metric space case. Parameter estimation improves with samples n in learning to rank showing nearly-consistent behavior. Our tensor decomposition estimator outperforms [SLV+22]. In particular, (top left) as the number of samples increases, our estimates of the positive and negative components of T improve. (Top right) the improvements in T recovery with more samples translates to significantly improved performance over [SLV+22], which is close to constant across n. (Bottom) this improved parameter estimation further translates to improvements in label model accuracy (using only the noisy estimates for prediction, without training an end model) and end model generalization. For the top two plots, we use θ = [6, 3, 8], and in the bottom plot, we use θ = [0, 0, 1]. In all plots, we report medians along with upper and lower quartiles across 10 trials.

Assumption 8. (Kernel Symmetry) We assume that for all x and all v Tq M, α(x)(expq(v)) = α(x)(expq( v)).

The first condition provides control on how geodesic triangles behave; it relates to the curvature. We provide more details on this in the Appendix. The second assumption restricts us to kernels symmetric about the minimizers of the objective F. Finally, suppose we draw (x, y) and (x , y ) independently from PXY . Set σ2 o = α(x)(y)Ed2 Y(y, y ).

Theorem 4. Let M be a complete manifold and suppose the assumptions above hold. Then, there exist constants C3, C4 such that,

E[d2 Y( ˆf(x), f(x))] C3σ2 o + C4 Pm a=1 β2 a(ˆµ2 a + σ2 o) n(1 kmin)2 .

Note that as n grows, as long as our worst-quality LF has bounded variance, our estimator of the true predictor is consistent. Moreover, we also have favorable dependence on the noise rate. This is because the only error we incur is in computing suboptimal β coefficients. We comment on this suboptimality in the Appendix.

A simple corollary of Theorem 4 provides the generalization guarantees we sought,

Corollary 1. Let M be a complete manifold and suppose the assumptions above hold. Then, with high probability, R( f) R(f ) + O(n 1

Extensions to Other Manifolds First, we note that all of our approaches almost immediately lift to products of constant-curvature spaces. For example, we have that M1 M2 has metric d2 Y(p, q) = d2 M1(p1, q1) + d2 M2(p2, q2), where pi, qi are the projections of p, q onto the ith component.

102 103 104 105 106

Absolute Error on LF 1

Figure 3: Continuous case. Parameter estimation improves with more samples in the hyperbolic regression problem. Our estimator outperforms [SLV+22]. Here, we use different randomly sampled values of θ for each run. We report medians along with upper and lower quartiles across 10 trials.

We can go beyond products of constant-curvature spaces as well. To do so, we can build generalizations of the law of cosines (as needed for the invariance in Lemma 1). For example, it is possible to do so for symmetric Riemannian manifolds using the tools in [AH91].

6 Experiments

Finally, we validate our theoretical claims with experimental results demonstrating improved parameter recovery and end model generalization using our techniques over that of prior work [SLV+22]. We illustrate both the finite metric space and continuous space cases by targeting rankings (i.e., permutations) and hyperbolic spaces. In the case of rankings we show that our pseudo-Euclidean embeddings with tensor decomposition estimator yields stronger parameter recovery and downstream generalization than [SLV+22]. In the case of hyperbolic regression (an example of a Riemannian manifold), we show that our estimator yields improved parameter recovery over [SLV+22].

Finite metric spaces: Learning to rank To experimentally evaluate our tensor decomposition estimator for finite metric spaces, we consider the problem of learning to rank. We construct a synthetic dataset whose ground truth comprises n samples of two distinct rankings among the finite metric space of all length-four permutations. We construct three labeling functions by sampling rankings according to a Mallows model, for which we obtain pseudo-Euclidean embeddings to use with our tensor decomposition estimator.

In Figure 2 (top left), we show that as we increase the number of samples, we can obtain an increasingly accurate estimate of T exhibiting the nearly-consistent behavior predicted by our theoretical claims. This leads to downstream improvements in parameter estimates, which also become more accurate as n increases. In contrast, we find that the estimates of the same parameters given by [SLV+22] do not improve substantially as n increases, and are ultimately worse (see Figure 2, top right). Finally, this leads to improvements in the label model accuracy as compared to that of [SLV+22], and translates to improved accuracy of an end model trained using synthetic samples (see Figure 2, bottom).

Riemannian manifolds: Hyperbolic regression We similarly evaluate our estimator using synthetic labels from a hyperbolic manifold, matching the setting of Section 5. As shown in Figure 3, we find that our estimator consistently outperforms that of [SLV+22], often by an order of magnitude.

7 Conclusion

We studied the theoretical properties of weak supervision applied to structured prediction in two general scenarios: label spaces that are finite metric spaces or constant-curvature manifolds. We introduced ways to estimate the noise rates of labeling functions, achieving consistency or nearconsistency. Using these tools, we established that with suitable modifications downstream structured prediction models maintain generalization guarantees. Future directions include extending these results to even more general manifolds and removing some of the assumptions needed in our analysis.

Acknowledgments

We are grateful for the support of the NSF (CCF2106707), the American Family Funding Initiative and the Wisconsin Alumni Research Foundation (WARF). We are thankful to Changho Shin and Harshavardhan Adepu for the discussions and feedback.

[AGH+14] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773 2832, 2014.

[AGJ14] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Sample complexity analysis for learning overcomplete latent variable models through tensor methods. ar Xiv preprint ar Xiv:1408.0553, 2014.

[AH91] Helmer Aslaksen and Hsueh-Ling Huynh. Laws of trigonometry in symmetric spaces. Geometry from the Pacific Rim, 1991.

[BHS+07] Gükhan H. Bakir, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan. Predicting Structured Data (Neural Information Processing). The MIT Press, 2007.

[CRR16] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach for structured prediction. In Advances in Neural Information Processing Systems 30 (NIPS 2016), volume 30, 2016.

[DRS+20] Jared A. Dunnmon, Alexander J. Ratner, Khaled Saab, Nishith Khandwala, Matthew Markert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew P. Lungren, Daniel L. Rubin, and Christopher Ré. Cross-modal data programming enables rapid medical machine learning. Patterns, 1(2), 2020.

[FCS+20] Daniel Y. Fu, Mayee F. Chen, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, and Christopher Ré. Fast and three-rious: Speeding up weak supervision with triplet methods. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020.

[Gol85] Lev Goldfarb. A new approach to pattern recognition. pages 241 402, 1985.

[GS19] Colin Graber and Alexander Schwing. Graph structured prediction energy networks. In Advances in Neural Information Processing Systems 33 (Neur IPS 2019), volume 33, 2019.

[KAG18] Anna Korba and Florence d Alché-Buc Alexandre Garcia. A structured prediction approach for label ranking. In Advances in Neural Information Processing Systems 32 (Neur IPS 2018), volume 32, 2018.

[KL15] Volodymyr Kuleshov and Percy S Liang. Calibrated structured prediction. In Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015.

[Kon14] Aryeh Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 28 36, Bejing, China, 22 24 Jun 2014. PMLR.

[KW78] J.B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, 1978.

[Lee00] John M. Lee. Introduction to Smooth Manifolds. Springer, 2000.

[LRBM06] Julian Laub, Volker Roth, Joachim M Buhmann, and Klaus-Robert Müller. On the information and representation of non-euclidean pairwise data. Pattern Recognition, 39(10):1815 1826, 2006.

[NDRT13] Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS 13, page 1196 1204, 2013.

[PHD+06] El zbieta P ekalska, Artsiom Harol, Robert P. W. Duin, Barbara Spillmann, and Horst Bunke. Non-euclidean or non-metric measures can be informative. In Dit-Yan Yeung, James T. Kwok, Ana Fred, Fabio Roli, and Dick de Ridder, editors, Structural, Syntactic, and Statistical Pattern Recognition, pages 871 880, 2006.

[PM19] Alexander Petersen and Hans-Georg Müller. Fréchet regression for random objects with euclidean predictors. Annals of Statistics, 47(2):691 719, 2019.

[PPD01] El zbieta P ekalska, Pavel Paclik, and Robert P.W. Duin. A generalized kernel approach to dissimilarity-based classification. Journal of Machine Learning Research, 2:175 211, 2001.

[RBE+18] Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the 44th International Conference on Very Large Data Bases (VLDB), Rio de Janeiro, Brazil, 2018.

[RCMR18] Alessandro Rudi, Carlo Ciliberto, Gian Maria Marconi, and Lorenzo Rosasco. Manifold structured prediction. In Advances in Neural Information Processing Systems 32 (Neur IPS 2018), volume 32, 2018.

[RHD+19] A. J. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, and C. Ré. Training complex models with multi-task weak supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, 2019.

[RNGS20] Christopher Ré, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. Overton: A data system for monitoring and improving machine-learned products. In Proceedings of the 10th Annual Conference on Innovative Data Systems Research, 2020.

[RSW+16] A. J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Proceedings of the 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 2016.

[SLB20] Esteban Safranchik, Shiying Luo, and Stephen Bach. Weakly supervised sequence tagging from noisy rules. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 5570 5578, Apr. 2020.

[SLV+22] Changho Shin, Winfred Li, Harit Vishwakarma, Nicholas Carl Roberts, and Frederic Sala. Universalizing weak supervision. In International Conference on Learning Representations, 2022.

[Str20] Austin J. Stromme. Wasserstein Barycenters: Statistics and Optimization. MIT, 2020.

[Tu11] Loring W. Tu. An Introduction to Manifolds. Springer, 2011.

[v RW18] Brendan van Rooyen and Robert C. Williamson. A theory of learning with corrupted labels. Journal of Machine Learning Research, 18(228):1 50, 2018.

1. Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] 2. Did you describe the limitations of your work? [Yes] 3. Did you discuss any potential negative societal impacts of your work? [N/A] 4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

[Yes] 5. Did you state the full set of assumptions of all theoretical results? [Yes] 6. Did you include complete proofs of all theoretical results? [Yes] See appendix