# creating_training_sets_via_weak_indirect_supervision__58151490.pdf

Published as a conference paper at ICLR 2022

CREATING TRAINING SETS VIA WEAK INDIRECT SUPERVISION

Jieyu Zhang1,2, Bohan Wang1,3, Xiangchen Song4, Yujing Wang1, Yaming Yang1, Jing Bai1, Alexander Ratner2,5

1Microsoft Research Asia 2University of Washington 3University of Science and Technology of China 4Carnegie Mellon University 5Snorkel AI, Inc. {jieyuz2, ajratner}@cs.washington.edu {yujwang, yayaming, jbai}@microsoft.com wbhfy@mail.ustc.edu.cn xiangchensong@cmu.edu

Creating labeled training sets has become one of the major roadblocks in machine learning. To address this, recent Weak Supervision (WS) frameworks synthesize training labels from multiple potentially noisy supervision sources. However, existing frameworks are restricted to supervision sources that share the same output space as the target task. To extend the scope of usable sources, we formulate Weak Indirect Supervision (WIS), a new research problem for automatically synthesizing training labels based on indirect supervision sources that have different output label spaces. To overcome the challenge of mismatched output spaces, we develop a probabilistic modeling approach, PLRM, which uses user-provided label relations to model and leverage indirect supervision sources. Moreover, we provide a theoretically-principled test of the distinguishability of PLRM for unseen labels, along with a generalization bound. On both image and text classification tasks as well as an industrial advertising application, we demonstrate the advantages of PLRM by outperforming baselines by a margin of 2%-9%.

1 INTRODUCTION

One of the greatest bottlenecks of using modern machine learning models is the need for substantial amounts of manually-labeled training data. In real-world applications, such manual annotations are typically time-consuming, labor-intensive and static. To reduce the efforts of annotation, researchers have proposed Weak Supervision (WS) frameworks (Ratner et al., 2016; 2018; 2019; Fu et al., 2020) for synthesizing labels from multiple weak supervision sources, e.g., heuristics, knowledge bases, or pre-trained classifiers. These frameworks have been widely applied on various machine learning tasks (Dunnmon et al., 2020; Fries et al., 2021; Safranchik et al., 2020; Lison et al., 2020; Zhou et al., 2020; Hooper et al., 2021; Zhan et al., 2019; Varma et al., 2019) and industrial data (Bach et al., 2019). Among them, data programming (Ratner et al., 2016), one representative example that generalizes many approaches in the literature, represents weak supervision sources as labeling functions (LFs) and synthesizes training labels using Probabilistic Graphical Model (PGM).

Given both the increasing popularity of WS and the general increase in open-source availability of machine learning models and tools, there is a rising tide of available supervision sources that WS frameworks and practitioners could potentially leverage, including pre-trained machine learning models or prediction APIs (Chen et al., 2020; d Andrea & Mintz, 2019; Yao et al., 2017). However, existing WS frameworks only utilize weak supervision sources with the same label space as the target task. This incompatibility largely limits the scope of usable sources, necessitating manual effort from domain experts to provide supervision for unseen labels. For example, consider target task of classifying { dog , wolf , cat , lion } and a set of three weak supervision sources (e.g. trained classifiers or expert heuristics) with disjoint output spaces { caninae , felidae }, { domestic animals , wild animals } and { husky , bengal cat } respectively. We call these types of sources indirect supervision sources. For concreteness, we follow the general convention of data programming (Ratner et al., 2016) and refer to these sources as indirect labeling functions (ILFs).

Published as a conference paper at ICLR 2022

Despite their apparent utility, existing weak supervision methods could not directly leverage such ILFs, as their output spaces have no overlap with the target one.

In this paper, we formulate a novel research problem that aims to leverage such ILFs automatically, minimizing the manual efforts to develop and deploy new models. We refer to this as the Weak Indirect Supervision (WIS) setting, a new Weak Supervision paradigm which leverages ILFs, along with the relational structures between individual labels, to automatically create training labels.

The key difficulty of leveraging ILFs is due to the mismatched label spaces. To overcome this, we introduce pairwise relations between individual labels to the WIS setup, which are often available in structured sources (e.g. off-the-shelf Knowledge Bases (Miller, 1995; Sinha et al., 2015; Dong et al., 2020) or large scale label hierarchies (Murty et al., 2017; The Gene Ontology Consortium, 2018; Partalas et al., 2015) for various domains), or can be provided by subject matter experts in far less time than generating entirely new sets of weak supervision sources. For example, in the aforementioned example, we could rely on a biological species ontology to see that the unseen labels dog and cat are both subsumed by the seen label domestic animals . Based on the label relations, we can automatically leverage the supervision sources as ILFs. Notably, previous work (Qu et al., 2020) also leveraged a label relation graph but was focused on relation extraction task in a few-shot learning setting, while You et al. (2020) proposed to learn label relations given data for each label in a transfer learning scenario. In contrast, we aim to solve the target task directly and without clean labeled data.

The remaining questions are (1) how to synthesize labels based on pair-wise label relations and ILFs? and (2) How can we know whether, given a set of ILFs and label relations, the unseen labels are distinguishable or not? To address the first question, we develop a probabilistic label relation model (PLRM), the first PGM for WIS which aggregates the output of ILFs and models the label relations as dependencies between random variables. In turn, we use the learned PLRM to produce labels for training an end model. Furthermore, we derive the generalization error bound of PLRM based on assumptions similar to previous work (Ratner et al., 2016).

The second question presents an important stumbling block when dealing with unseen labels, as we may not be able to distinguish the unseen labels given existing label relations and ILFs, resulting in an unsatisfactory synthesized training set. To address this issue, we formally introduce the notion of distinguishability in WIS setting and theoretically establish an equivalence between: (1) the distinguishability of the label relation structure as well as the ILFs, and (2) the capability of PLRM to distinguish unseen labels. This result then leads to a simple sanity test for preventing the model from failing to distinguish unseen labels. In preliminary experiments, we observe a significant drop in model performance when the condition is violated.

In experiments, we make non-trivial adaptations for baselines from related settings to the new WIS problem. On both text and image classification tasks, we demonstrate the advantages of PLRM over adapted baselines. Finally, in a commercial advertising system where developers need to collect annotations for new ads tags, we illustrate how to formulate the training label collection as a WIS problem and apply PLRM to achieve an effective performance.

Summary of Contributions. Our contributions are summarized as follows:

We formulate Weak Indirect Supervision (WIS), a new research problem which synthesizes training labels based on indirect supervision sources and label relations, minimizing human efforts of both data annotation and weak supervision sources construction;

We develop the first model for WIS, the Probabilistic Label Relation Model (PLRM) with comparable statistical efficiency to previous WS frameworks and standard supervised learning;

We introduce a new notion of distinguishability in WIS setting, and provide a simple test of the distinguishability of PLRM for unseen labels by theoretically establishing the connection between the label relation structures and distinguishability;

We showcase the potential of the WIS formulation and the effectiveness of PLRM in a commercial advertising system for synthesizing training labels of new ads tags. On academic image and text classification tasks, we demonstrate the advantages of PLRM over baselines by quantitative experiments. Overall, PLRM outperforms baselines by a margin of 2%-9%.

Published as a conference paper at ICLR 2022

2 RELATED WORK

Table 1: Comparisons between the proposed weak indirect supervision (WIS) and related machine learning tasks. Compared to normal and weakly supervised learning, WIS handles mismatched train and test label spaces. WIS is similar in spirit to indirect supervision (IS) and zero-shot learning (ZSL), but distinct in that WIS only takes as input weak or noisy labels and a simple set of logical label relations, and aims to output a training data set rather than a trained model, affording complete modularity in which final model class is used.

Task Label Type Ytrain = Ytest Label Information When Label Info. is Required

Supervised Learning (SL) Clean Labels Weak Supervision (WS) Noisy Sources Indirect Supervision (IS) Clean Labels Label Trans. Matrix Training Zero-Shot Learning (ZSL) Clean Labels Label Embed. / Attribute Training & Test

Weak Indirect Supervision (WIS) Noisy Sources Label Relation Training

We briefly review related settings. The comparison between WIS and related tasks is in Table 1.

Weak Supervision: We draw motivation from recent work which model and integrate weak supervision sources using PGMs (Ratner et al., 2016; 2018; 2019; Fu et al., 2020) and other methods (Guan et al., 2018; Khetan et al., 2018) to create training sets. While they assume supervision sources share the same label space as the new tasks, we aim to leverage indirect supervision sources with mismatched label spaces in a labor-free way.

Indirect Supervision: Indirect supervision arises more generally in latent-variable models for various domains (Brown et al., 1993; Liang et al., 2013; Quattoni et al., 2004; Chang et al., 2010; Zhang et al., 2019). Very recently, Raghunathan et al. (2016) proposed to use the linear moment method for indirect supervision, wherein the transition between desired label space Y and indirect supervision space O is known, as well as the ground truth of indirect supervisions for training. In contrast, both are unavailable in WIS. Theoretically, Wang et al. (2020) developed a unified framework for analyzing the learnability of indirect supervision with shared or superset label spaces, while we focus on disjoint label spaces and the consequent unique challenge of distinguishability of unseen classes.

Zero-Shot Learning: Zero-Shot Learning (ZSL) (Lampert et al., 2009; Wang et al., 2019) aims to learn a classifier that is able to generalize to unseen classes. The WIS problem differentiates from ZSL by (1) in ZSL setting, the training and test data belong to seen and unseen classes, respectively, and training data is labeled, while for WIS, both training and test data belong to unseen classes and unlabeled; (2) ZSL tends to render a classifier that could predict unseen classes given certain label information, e.g., label attributes (Romera-Paredes & Torr, 2015), label descriptions (Srivastava et al., 2018) or label similarities (Frome et al., 2013), while WIS aims to provide training labels for unlabeled training data, allowing users to train any machine learning models, and the label relations are used only in synthesizing training labels.

3 PRELIMINARY: WEAK SUPERVISION

We first describe the Weak Supervision (WS) setting. A glossary of notations used is in App. A.

Definitions and notations. We assume a k-way classification task, and have an unlabeled dataset D consisting of m data points. Denote by Xi X the individual data point and Yi Y = {y1, . . . , yk} the unobserved interested label of Xi. We also have n sources, each represented by a labeling function (LF) and denoted by λj. Each λj outputs a label ˆY j i Yλj = {ˆyj 1, . . . , ˆyj kλj } on Xi, where

Yλj is the label space associated with λj and |Yλj| = kλj. We denote the concatenation of LFs output as ˆYi = [ ˆY 1 i , ˆY 2 i , . . . , ˆY n i ], and the union set of LFs label spaces as ˆY with | ˆY| = ˆk. Note that ˆk is not necessarily equal to the sum over kλj, since LFs may have overlapping label spaces. We call ˆy ˆY seen label and y Y desired labels. In WS settings, we have Y ˆY. Notably, we assume all the involved labels come from the same semantic space.

The goal of WS. The goal is to infer the training labels for the dataset D based on LFs, and to use them to train an end discriminative classifier f W : X Y, all without ground truth training labels.

Published as a conference paper at ICLR 2022

Indirect Labeling Functions

Unlabeled Data

Label Graph

Domestic Animal

Wild Animal

Probabilistic Labels End Model Label Model

Canine Felidae Wild Animals Husky Bengal Cat Domestic Animals

def labeling rule:

return label

def labeling rule:

return label

def labeling rule:

return label

=</latexit>Y

ˆY 1 ˆY 2 ˆY 3

t B1Mbj O/USVZp F8MNOY+g KPJAs Zwc ZK3X6AVdqd PXq Dcs Wtun Og Ve Llp AI5Go Py V38Yk URQa Qj HWvc8Nz Z+ip Vh NZq Z9o Gm Myw SPas1Ri Qb Wfzg+eo TOr DFEYKVv So Ln6ey LFQup CGynw Gasl71M/M/r JSa89l Mm48RQSRa Lwo Qj E6Hsez Rkih LDp5Zgopi9FZEx Vpg Ym1HJhu Atv7x KWhd V7Jau69V6jd5HEU4g VM4Bw+uo A530IAm EBDw DK/w5ijnx Xl3Phat BSef OY/c D5/AKCak E4=</latexit> Y 1

a Tu Y3M79h NVmk Xyw Uxj6gs8kixk Bsrdfs BVml39lg ZFEtu2V0Ar RMv Iy XI0Bg Uv/r Di CSCSk M41rnub Hx U6w MI5z OCv1E0xi TCR7Rnq USC6r9d Hw DF1YZYj CSNm SBi3U3x Mp Flp PRWA7BTZjver Nxf+8Xm LCaz9l Mk4Ml WS5KEw4Mh Gaf4+GTFi+NQSTBSzty Iyxgo TYz Mq2BC81Zf XSat S9mrl6n21VL/J4sj DGZz DJXhw BXW4gw Y0g YCAZ3i FN0c5L86787Fsz Tn Zz Cn8gf P5A6Iek E8=</latexit> Y 2

Figure 1: An example of WIS problem: the input consists of an unlabeled dataset, a label graph, and n indirect labeling functions (ILFs). The ILFs represent weak supervision sources such as pretrained classifiers, knowledge bases, heuristic rules, etc. We can see that the ILFs cannot predict desired labels i.e., { dog , wolf , cat , lion }. To address this, a label graph is given; here we only visualize the subsuming relation. Finally, a label model, instantiated as a PGM, takes the ILF s outputs and produces probabilistic labels in the target output space, which are in turn used to train an end machine learning model that can generalize beyond them.

4 WEAK INDIRECT SUPERVISION

Now, we introduce the new Weak Indirect Supervision (WIS) problem. Unlike the standard WS setting, we only have indirect labeling functions (ILFs) instead of LFs, and an additional label graph is given. The goal of WIS remains the same as WS. An example of WIS problem is in Fig. 1.

Indirect Labeling Function. In WIS, we only have indirect labeling functions (ILFs), which cannot directly predict any desired labels, i.e., ˆY Y = . Therefore, we refer to the desired labels as unseen labels. To make it possible to leverage the ILFs, a label graph is given, which encodes pair-wise label relations between different seen and unseen labels.

Label Graph. Concretely, a label graph G = (V, E) consists of (1) a set of all the labels as nodes, i.e., V = ˆY Y, and (2) a set of pair-wise label relations as typed edges, i.e., E = {(yi, yj, tyiyj)|tyiyj T , i < j, yi, yj V}. Here, T is the set of label relation types and, similar to Deng et al. (2014), there are four types of label relations: exclusive, overlapping, subsuming, subsumed, notated by to, te, tsg, tsd, respectively. Notably, for any ordered pair of labels (yi, yj), their label relation should fall into one of the four types. The rationale behind these label relations is that when treating each label as a set, there are four unique set relations and each corresponds to one defined label relation respectively as shown in Fig. 2. For convenience, we denote the set of non-exclusive neighbors of a given label y in ˆY as N(y, ˆY), i.e., N(y, ˆY) = {ˆy ˆY|tyˆy = te}.

Exclusive Overlap

Subsumed Set relation A \ B = ; A \ B 6= ;, A 6 B, B 6 A Label relation:

A $ B A % B

Figure 2: The one-to-one mapping between label relations and set relations.

5 PROBABILISTIC LABEL RELATION MODEL

One of the key difficulties in both WS and WIS is that we do not observe the true label Yi. Following prior work (Ratner et al., 2016; 2019; Fu et al., 2020), we use a latent variable Probabilistic Graphical Model (PGM) for estimating Yi based on the ˆYi output by ILFs. Specifically, the PGM is instantiated

Published as a conference paper at ICLR 2022

as a factor graph model. This standard technique lets us describe the family of generative distributions in terms of M known dependencies/factor functions {ϕ}, and an unknown parameter Θ RM as PΘ( ) exp(Θ Φ( )), where Φ is the concatenation of {ϕ}. However, the unique challenge for WIS is that the dependencies {ϕ} between Yi and ˆYi are unknown due to the mismatches of label spaces. We overcome these by leveraging the label graph G to build the dependencies for the PGM.

5.1 A BASELINE PGM FOR WIS

In prior work (Ratner et al., 2016; Bach et al., 2017), the PGM for WS is governed by accuracy dependencies: ϕAcc y,j(Y, ˆY j) := 1{Y = ˆY j = y} which is defined for each λj and y Yλj Y. However, in WIS, the ILFs cannot predict desired label y Y. As a simple baseline approach to start, we leverage the coarse-grained exclusive/non-exclusive label relation to build a corresponding "accuracy" factor. Specifically, for an ILF λj and one label ˆy Yλj, given a desired label y Y, if ˆy and y have non-exclusive label relation, i.e., ˆy N(y, Yλj) we expect a certain portion of data assigned ˆy should be labeled as y. Thus, we treat ˆY j = ˆy as a pseudo indicator of Y = y and add a pseudo accuracy dependency between them:

ϕAcc y,ˆy,j(Y, ˆY j) := 1{Y = y ˆY j = ˆy}

We call the PGM governed by pseudo accuracy dependencies Weak Supervision with Label Graph (WS-LG). Notably, it can be treated as a simple adaptation of PGM for WS (Ratner et al., 2016; 2019; Fu et al., 2020) to the WIS problem. However, such a naïve adaptation might have two drawbacks:

1. It does not model specific dependencies ILFs with different undesired labels. For example, two ILFs outputting Husky and bulldog respectively would be naively modeled the same as if they both output Dog . 2. It can only directly model exclusive/non-exclusive label relations, ignoring the prior knowledge encoded in other relation types, i.e., subsuming, subsumed, or overlapping. For example, given an unseen label Dog and some ILFs outputting Husky or Domestic Animals , WS-LG would treat all ILFs as indicators of Dog . However, we know a Husky is of course a Dog (subsumed relation) while a Domestic Animals is not necessarily a Dog (subsuming relation).

5.2 PROBABILISTIC LABEL RELATION MODEL

To more directly model the full range and nuance of label relations, we propose a new probabilistic label relation model (PLRM). In PLRM, we explicitly model both (1) the dependency between ILF outputs and the true labels in their output spaces, i.e. their direct accuracy, and (2) the dependencies between these labels and the target unseen labels, as separate dependency types, thus explicitly incorporating the full label relation graph into our model and learning its corresponding weights.

Concretely, we augment the WS-LG model with (1) latent variables representing the assignment of the data to each seen label, and (2) label relation dependencies which capture fine-grained label relations between these output labels and desired labels. To model seen label in ˆY, we introduce a binary latent random vector Y = [ Y 1, . . . , Y ˆk], where Y i indicating whether the data should be assigned ˆyi. Then, for ILF λj that could predict ˆyi, we have accuracy dependency:

ϕAcc ˆyi,j( Y i, ˆY j) := 1{ Y i = 1 ˆY j = ˆyi}

To model fine-grained label relations, for a desired label y Y and seen label ˆyi ˆY, we add label relation dependencies. We enumerate the label relation dependencies corresponding to the four label relation types, i.e., exclusive, overlapping, subsuming, subsumed, as follows:

ϕe y,ˆyi(Y, Y i) := 1{Y = y Y i = 1}

ϕo y,ˆyi(Y, Y i) := 1{Y = y Y i = 1}

ϕsg y,ˆyi(Y, Y i) := 1{Y = y Y i = 1}

ϕsd y,ˆyi(Y, Y i) := 1{Y = y Y i = 0}

Published as a conference paper at ICLR 2022

The above dependencies encode the prior knowledge of the label relations, but also allow the model to learn corresponding parameters. For example, an exclusive label relation dependency ϕe outputs -1 when two exclusive labels are activated at the same time for the same data, otherwise 0, which reflects our prior knowledge of the exclusive label relation; and the corresponding parameter can be treated as the strength of the label relation. Likewise, for any pair of seen labels, we add label relation dependency following the same convention. Finally, we specify the model as:

PΘ(Y, Y , ˆY ) exp Θ Φ(Y, Y , ˆY ) . (1)

Recall that Y is the unobserved true label, Y is the binary random vector, each of whose binary value Y i reflects whether the data should be assigned seen label ˆyi ˆY, and ˆY is the concatenated outputs of ILFs.

Learning Objective. We estimate the parameters ˆΘ by minimizing the negative log marginal likelihood PΘ( ˆY ) for observed ILF outputs ˆY1:m:

ˆΘ = arg min Θ

Y, Y PΘ(Y, Y , ˆYi) . (2)

We follow Ratner et al. (2016) to optimize the objective using stochastic gradient descent.

Training an End Model. Let p ˆΘ(Y | ˆY ) be the probabilistic label (i.e. distribution) predicted by learned PLRM. We then train an end model f W : X Y parameterized by W, by minimizing the empirical noise-aware loss (Ratner et al., 2019) with respect to ˆΘ over m unlabeled data points:

ˆ W = arg min W 1 m

i=1 EY p ˆ Θ(Y | ˆYi)ℓ(Y, f W (Xi)), (3)

where ℓ(Y, f W (Xi)) is a standard cross entropy loss.

Generalization Error Bound. We extend previous results from (Ratner et al., 2016) to bound both the expected error of learned parameter ˆΘ and the expected risk for ˆW. All the proof details and description of assumptions can be found in Appendix.

Theorem 1. Suppose that we run stochastic gradient descent to produce ˆΘ and ˆW based on Eqs. (2) and (3), respectively, and that our setup satisfies certain assumptions (App D.2). Let |D| be the size of the unlabeled dataset. Then we have

E ˆΘ Θ 2 O M log |D|

, E h ℓ( ˆ W) ℓ(W ) i χ + O

Interpreting the Bound. By Theorem 1, the two errors decrease by the rate O(1/|D|) and O(1/|D|1/2) respectively as |D| increases. This shows that although we trade computational efficiency for the reduction of human efforts by using complex dependencies and more latent variables, we maintain comparable statistical efficiency as previous WS frameworks and supervised learning theoretically.

6 DISTINGUISHABILITY OF UNSEEN LABELS

Husky Bulldog

Figure 3: Example of indistinguishable unseen labels Husky and Bulldog .

One unique challenge of WIS is that there may exist pairs of unseen labels which cannot be distinguished by the learned model. For example, as shown in Fig. 3, where Dog is a seen label for which LFs could predict for and Husky and Bulldog are unseen labels for which we want to generate training labels; however, we could not distinguish between Husky and Bulldog even though the LFs make correct predictions of seen label Dog , because both Husky and Bulldog share the same label relation to Dog .

To tackle this issue, we theoretically connect the distinguishability of unseen labels to the label relation structures and provide a testable

Published as a conference paper at ICLR 2022

condition for the distinguishability. Intuitively, same label relation structures could lead to indistinguishable unseen labels as shown in Fig. 3; however, it turns out to be challenging to prove that different label relation structures could guarantee the distinguishability with respect to the model. To illustrate, we formally define the distinguishability as below.

Definition 1 (Distinguishability). For any model PΘ(Y, Y , ˆY ) with parameters Θ, any pair of unseen labels yi, yj Y are distinguishable w.r.t. the model, if for a.e. Θ > 0 (element-wisely), there does NOT exist such a Θ > 0 that, for Y , ˆY , the following equations hold

PΘ(Y = yi| Y , ˆY ) = P Θ(Y = yj| Y , ˆY ), PΘ(Y = yj| Y , ˆY ) = P Θ(Y = yi| Y , ˆY ), (4)

PΘ(Y = y| Y , ˆY ) = P Θ(Y = y| Y , ˆY ), y Y/{yi, yj}, (5)

PΘ( ˆY ) = P Θ( ˆY ). (6)

From the definition, we can see that the opposite of distinguishability, i.e., indistinguishability, describes an undesired model: for any learned parameter Θ > 0, we can always find another Θ which optimizes the loss equally well (Eq. (6)), but Eqs. (4-5) implies whenever PΘ predict yi, P Θ will predict yj instead, which reflects that the model cannot distinguish the two unseen labels. Note that the notion of distinguishability is different from the identifiability in PGMs: the generic identifiability (Allman et al., 2015), the strongest notion of identifiability, requires the model to be identifiable up to label swapping, while the distinguishability aims to avoid the label swapping.

However, distinguishability is hard to verify since Eqs. (4-5) and (6) need to hold for any possible configuration of Y , ˆY , and any pair of unseen labels. Fortunately, for the proposed PLRM, we prove that distinguishability is equivalent to the asymmetry of the label relation structures when two conditions hold. To state the required conditions, we first introduce the notations of consistency and informativeness to characterize the label graph and ILFs.

Consistency. We discuss the consistency of a label graph to avoid an ambiguous or unrealistic label graph. We interpret semantic labels ya, yb as sets A, B, and then connect the label relations to the set relations (Fig. 2). Given the set interpretations, we define the consistency of label graph as:

Definition 2 (Consistent Label Graph). A label graph G = (Y, E) is consistent if the induced set relations are consistent.

For example, assume Y = {ya, yb, yc}, and tab = tbc = tca = tsg. From tab, tbc, we can observe that A B C, which contradicts to C A implied by tca = tsg. Thus, G is inconsistent.

Informativeness. In addition, we try to describe what kind of ILF is desired. Intuitively, an ILF is uninformative if it always "votes" for one of the desired labels. For example, if the desired label space Y is { Dog , Bird }, then for an ILF λ1 outputting { Husky , Bulldog }, we know Dog is non-exclusive to Husky and Bulldog , while Bird exclusive to both. In such case, λ1 can hardly provide information to help distinguish Dog from Bird , because it always votes for Dog . On the other hand, a binary classifier of Husky , i.e., λ2, is favorable since it could output Not a Husky to avoid consistently voting for Dog . We can see an undesired ILF always votes for a single desired label. To formally describe this, we define an informative ILF as:

Definition 3 (Informative ILF). An ILF λj is informative if, for y Y, there exists Xi D s.t. the output of λj on Xi is not in N(y, Yλj), i.e., ˆY j i N(y, Yλj).

Testable Conditions for Distinguishability. Based on the introduced notations, we prove the necessary and sufficient condition for learned PLRM being able to distinguish unseen labels:

Theorem 2. For PLRM induced from a consistent label graph, as well as informative ILFs, for any pair of yi, yj Y, they are indistinguishable, if and only if tik = tjk for yk ˆY.

Theorem 2 provides users with a testable condition: for any pair of unseen labels yi, yj, there should exist at least one seen label yk such that yk has different label relations to yi and yj, i.e., tik = tjk, so that PLRM is able to distinguish yi and yj. In preliminary experiments, we observe the violation of this condition causes a dramatic drop in overall performance (about 10 points). Notably, based on Theorem 2, users could theoretically guarantee the distinguishability of a pair of unseen labels by adding only one seen label and corresponding ILFs to break the symmetry.

Published as a conference paper at ICLR 2022

7 EXPERIMENTS

We demonstrate the applicability and performance of our method on image classification tasks derived from ILSVRC2012 (Russakovsky et al., 2015) and text classification tasks derived from LSHTC-3 (Partalas et al., 2015). Both datasets have off-the-shelf label relation structure (Deng et al., 2014; Partalas et al., 2015), which are directed acyclic graphs (DAGS) and from which we could query pairwise label relations. Indeed, there is a one-to-one mapping between a DAG structure of labels and a consistent label graph (See App. E.1 for an example). The ILSVRC2012 dataset consists of 1.2M training images from 1,000 leave classes; for non-leave classes, we follow Deng et al. (2014) to aggregate images belonging to its descendent classes as its data points. The LSHTC-3 dataset consists of 456,886 documents and 36,504 labels organized in a DAG.

For each dataset, we randomly sample 100 different label graphs, each of which consists of 8 classes, and use each label graph to construct a WIS task. For each label graph, we treat 3 of the sampled classes as unseen classes and the other 5 as seen classes. The distinguishable condition in Sec. 6 is ensured for all the WIS tasks, and the performance drop when it is violated can be found in App. G.1. We sample data belonging to unseen classes for our experiments and split them into train and test set. For image classification tasks, we follow Mazzetto et al. (2021b;a) to train a branch of image classifiers as supervision sources of seen classes. For text classification tasks, we made keyword-based labeling functions as supervision sources of seen classes following Zhang et al. (2021); each of the labeling functions returns its associated label when a certain keyword exists in the text, otherwise abstains. Notably, all the involved supervision sources are "weak" because they cannot predict the desired unseen classes. Experimental details and additional results are in App. F.

7.2 COMPARED METHODS AND RESULTS

In addition to the WS-LG baseline, which is an adaptation of Data Programming (Ratner et al., 2019) to WIS task, and PLRM, we also include the following baselines. Note that all compared methods input the same data, ILFs, and label relations throughout our experiments for fair comparisons.

Label Relation Majority Voting (LR-MV). We modify the majority voting method based on the label s non-exclusive neighbors: we replace ˆy predicted by any ILF with the set of desired labels N(ˆy, Y), i.e., the desired labels with non-exclusive relation to ˆy, then aggregate the modified votes.

Weighted Label Relation Majority Voting (W-LR-MV). LR-MV only leverages exclusive/nonexclusive label relations. To leverage fine-grained label relations, W-LR-MV attaches a weight to each replaced label. Specifically, if the ILF s output ˆy is replaced with its ancestor label y (subsumed relation), then the weight of y equals 1, while for the other relations, the weight is 1 |Y (ˆy)|, where Y (ˆy) = {y Y(ˆy)|tyˆy = tsd}.

For the above methods, we compare the performance of (1) directly applying included models on the test set and (2) the end models (classifiers) trained with inferred training labels.

Zero-Shot Learning (ZSL). It is non-trivial to apply ZSL methods, because ZSL assumes label attributes for all classes and a labeled training set of seen classes, while WIS input an unlabeled dataset of unseen classes, label relations and ILFs. Fortunately, the Direct Attribute Prediction (DAP) (Lampert et al., 2013) method is able to make predictions solely based on attributes without labeled data, by training attribute classifier p(ai|x) for each attribute ai. Therefore we include it in our experiments. The details of applying DAP can be found in App. F.2.

Evaluation Results. For a fair comparison, we fix the network architecture of the classifiers for all the methods. For image classification, we use Res Net-32 (He et al., 2016) and for text classification, we use logistic regression with pre-trained text embedding (Reimers & Gurevych, 2019). The overall results for both datasets can be found in Table 2. From the results, we can see that PLRM consistently outperforms baselines. The advantages of PLRM show the effect of not just leveraging the label graph, as the baselines do, but modeling the accuracy of ILFs and the strengths of label relations

Published as a conference paper at ICLR 2022

Table 2: Averaged evaluation results over 100 WIS tasks derived from LSHTC-3 and ILSVRC2012.

Method LSHTC-3 ILSVRC2012

Accuracy F1-score Accuracy F1-score

DAP 42.90 13.53 35.98 15.73 33.25 3.68 29.13 4.63

Label Model

LR-MV 58.86 10.50 54.33 11.10 46.88 10.66 40.11 16.44 W-LR-MV 59.28 10.47 54.55 11.36 41.39 10.80 30.19 16.94 WS-LG 62.60 10.12 57.50 11.19 53.68 7.62 52.15 7.94

PLRM 64.65 11.30 60.01 13.39 56.18 7.35 54.94 7.44

LR-MV 67.17 12.25 62.49 13.95 49.60 12.80 42.83 18.17 W-LR-MV 66.57 11.73 61.80 13.24 42.61 12.46 31.34 18.20 WS-LG 70.69 13.05 67.36 14.24 56.56 9.68 54.57 11.17

PLRM 72.32 13.18 69.37 14.41 58.38 8.27 56.83 8.49

as PLRM does. The reported results have high variance, which actually indicates the 100 different WIS tasks are diverse and have varying difficulty. Also, we can see the end models are much better than directly applying the label models on the test set; this shows that the end models are able to generalize beyond the training labels produced by label models.

7.3 REAL-WORLD APPLICATION

In this section, on a commercial advertising system (CAS), we showcase how to reduce human annotation efforts of new labeling tasks by formulating them as WIS problems. In a CAS, ads tagging (classification) is a critical application for understanding the semantics of ads copy. When new ads and tags are added to the system, manual annotations need to be collected for training a new classifier. As tags are commonly organized as taxonomies, the label relations between existing and new tags are readily available or can be trivially figured out by humans; Existing classifiers and the heuristic rules previously used for annotating existing tags could serve as ILFs. Therefore, given (1) an unlabeled dataset of new tags, (2) the label relations, and (3) ILFs, we formulate it as a WIS problem.

On such WIS formulation, we apply our method and baselines, to synthesize training labels of new tags. Specifically, we have two WIS tasks where the tags are under the Car Accessories and Furniture categories respectively. For both tasks, we have 3 new tags and leverage 5 existing tags related to the new ones with given relations. On a test set, we evaluate the performance of DAP and the quality of labels produced by label models, as shown in Table 3. Note that since we re-use the existing labeling sources tailored for existing tags as ILFs and obtain label relations from an existing taxonomy, we achieve these results without any manual annotation or creation of new labeling functions. This demonstrates the potential of the proposed WIS task in real-world scenarios.

Table 3: Evaluation on product tagging with new tags.

Category Metric DAP LR-MV W-LR-MV WS-LG PLRM

Car Accessories F1 50.62 68.68 68.06 66.85 76.37 Accuracy 52.83 68.17 67.67 66.33 75.83

Furniture F1 30.81 64.70 61.45 70.59 80.57 Accuracy 33.60 72.53 72.13 74.51 82.02

8 CONCLUSION

We propose Weak Indirect Supervision (WIS), a new research problem which leverages indirect supervision sources and label relations to synthesize training labels for training machine learning models. We develop the first method for WIS called Probabilistic Label Relation Model (PLRM) with the generalization error bound of both PLRM and end model. We provide a theoretically-principled sanity test to ensure the distinguishability of unseen labels. Finally, we provide experiments to demonstrate the effectiveness of PLRM and its advantages over baselines on both academic datasets and industrial scenario.

Published as a conference paper at ICLR 2022

Reproducibility Statement. All the assumptions and proofs of our theory can be found in App. C & D. Examples and illustrations of label graph are in App. E. Experimental details can be found in App. F. Additional experiments are in App. G.

Elizabeth S Allman, John A Rhodes, Elena Stanghellini, and Marco Valtorta. Parameter identifiability of discrete bayesian networks with hidden variables. Journal of Causal Inference, 3(2):189 205, 2015.

Stephen H. Bach, Bryan He, Alexander J. Ratner, and Christopher Ré. Learning the structure of generative models without labeled data. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 2017.

Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alex Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Chris Ré, and Rob Malkin. Snorkel drybell: A case study in deploying weak supervision at industrial scale. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD 19, pp. 362 375, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450356435. doi: 10.1145/3299869.3314036. URL https://doi.org/10.1145/3299869.3314036.

Peter F Brown, Stephen A Della Pietra, Vincent J Della Pietra, and Robert L Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational linguistics, 19(2):263 311, 1993.

Ming-Wei Chang, Vivek Srikumar, Dan Goldwasser, and Dan Roth. Structured output learning with indirect supervision. In ICML, pp. 199 206, 2010.

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalml: How to use ml prediction apis more accurately and cheaply. In Advances in Neural Information Processing Systems (Neur IPS), 2020.

Carlos d Andrea and André Mintz. Studying the live cross-platform circulation of images with computer vision api: An experiment based on a sports media event. International Journal of Communication, 13(0), 2019. ISSN 1932-8036.

Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. Large-scale object classification using label relation graphs. In European conference on computer vision, pp. 48 64. Springer, 2014.

Xin Luna Dong, Xiang He, Andrey Kan, Xian Li, Yan Liang, Jun Ma, Yifan Ethan Xu, Chenwei Zhang, Tong Zhao, Gabriel Blanco Saldana, et al. Autoknow: Self-driving knowledge collection for products of thousands of types. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2724 2734, 2020.

Jared A. Dunnmon, Alexander J. Ratner, Khaled Saab, Nishith Khandwala, Matthew Markert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew P. Lungren, Daniel L. Rubin, and Christopher Ré. Cross-modal data programming enables rapid medical machine learning. Patterns, 1(2):100019, 2020. ISSN 2666-3899. doi: https://doi.org/10.1016/j.patter. 2020.100019. URL https://www.sciencedirect.com/science/article/pii/ S2666389920300192.

Jason A Fries, Ethan Steinberg, Saelig Khattar, Scott L Fleming, Jose Posada, Alison Callahan, and Nigam H Shah. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nature Communications, 12(1), 2021.

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc' Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26, pp. 2121 2129. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/file/ 7cce53cf90577442771720a370c3c723-Paper.pdf.

Published as a conference paper at ICLR 2022

Daniel Y. Fu, Mayee F. Chen, Frederic Sala, Sarah M. Hooper, Kayvon Fatahalian, and Christopher Ré. Fast and three-rious: Speeding up weak supervision with triplet methods. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020.

Melody Y. Guan, Varun Gulshan, Andrew M. Dai, and Geoffrey E. Hinton. Who said what: Modeling individual labelers improves classification. In AAAI, 2018.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. doi: 10.1109/CVPR.2016.90.

Sarah Hooper, Michael Wornow, Ying Hang Seah, Peter Kellman, Hui Xue, Frederic Sala, Curtis Langlotz, and Christopher Re. Cut out the annotator, keep the cutout: better segmentation with weak supervision. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=bjk X6Kzb5H.

Ashish Khetan, Zachary C. Lipton, and Anima Anandkumar. Learning from noisy singly-labeled data. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=H1s UHgb0Z.

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951 958. IEEE, 2009.

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zeroshot visual object categorization. IEEE transactions on pattern analysis and machine intelligence, 36(3):453 465, 2013.

Percy Liang, Michael I Jordan, and Dan Klein. Learning dependency-based compositional semantics. Computational Linguistics, 39(2):389 446, 2013.

Pierre Lison, Jeremy Barnes, Aliaksandr Hubin, and Samia Touileb. Named entity recognition without labelled data: A weak supervision approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1518 1533, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.139. URL https://www. aclweb.org/anthology/2020.acl-main.139.

Alessio Mazzetto, Cyrus Cousins, Dylan Sam, Stephen H Bach, and Eli Upfal. Adversarial multi class learning under weak supervision with performance guarantees. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7534 7543. PMLR, 18 24 Jul 2021a. URL https://proceedings.mlr.press/v139/mazzetto21a.html.

Alessio Mazzetto, Dylan Sam, Andrew Park, Eli Upfal, and Stephen Bach. Semi-supervised aggregation of dependent weak supervision sources with performance guarantees. In Arindam Banerjee and Kenji Fukumizu (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 3196 3204. PMLR, 13 15 Apr 2021b. URL https://proceedings.mlr.press/v130/ mazzetto21a.html.

George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11): 39 41, 1995.

Shikhar Murty, Pat Verga, L. Vilnis, and A. Mc Callum. Finer grained entity typing with typenet. AKBC Workshop, 2017.

Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artières, George Paliouras, Éric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, and Patrick Gallinari. LSHTC: A benchmark for large-scale text classification. Co RR, abs/1503.08581, 2015.

Meng Qu, Tianyu Gao, Louis-Pascal Xhonneux, and Jian Tang. Few-shot relation extraction via bayesian meta-learning on relation graphs. In International Conference on Machine Learning, pp. 7867 7876. PMLR, 2020.

Published as a conference paper at ICLR 2022

Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional random fields for object recognition. Advances in neural information processing systems, 17:1097 1104, 2004.

Aditi Raghunathan, Roy Frostig, John Duchi, and Percy Liang. Estimation from indirect supervision with linear moments. In International Conference on Machine Learning (ICML), 2016.

A. J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Proceedings of the 29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 2016.

A. J. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, and C. Ré. Training complex models with multi-task weak supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, 2019.

Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the 44th International Conference on Very Large Data Bases (VLDB), Rio de Janeiro, Brazil, 2018.

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908. 10084.

Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In International conference on machine learning, pp. 2152 2161. PMLR, 2015.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision, 115(3):211 252, December 2015. ISSN 0920-5691. doi: 10.1007/s11263-015-0816-y. URL https://doi.org/10.1007/ s11263-015-0816-y.

Esteban Safranchik, Shiying Luo, and Stephen Bach. Weakly supervised sequence tagging from noisy rules. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 5570 5578, 2020.

Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Paul Hsu, and Kuansan Wang. An overview of microsoft academic service (mas) and applications. In WWW, 2015.

Shashank Srivastava, Igor Labutov, and Tom Mitchell. Zero-shot learning of classifiers from natural language quantification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 306 316, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1029. URL https: //www.aclweb.org/anthology/P18-1029.

The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Research, 47(D1):D330 D338, 11 2018. ISSN 0305-1048. doi: 10.1093/nar/ gky1055. URL https://doi.org/10.1093/nar/gky1055.

Paroma Varma, Frederic Sala, Shiori Sagawa, Jason Alan Fries, Daniel Y. Fu, Saelig Khattar, Ashwini Ramamoorthy, Ke Xiao, Kayvon Fatahalian, James Priest, and Christopher Ré. Multiresolution weak supervision for sequential data. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 192 203, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 93db85ed909c13838ff95ccfa94cebd9-Abstract.html.

Kaifu Wang, Qiang Ning, and Dan Roth. Learnability with indirect supervision signals. Advances in Neural Information Processing Systems 32, 2020.

Wei Wang, Vincent W Zheng, Han Yu, and Chunyan Miao. A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10 (2):1 37, 2019.

Published as a conference paper at ICLR 2022

Yuanshun Yao, Zhujun Xiao, Bolun Wang, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. Complexity vs. performance: Empirical analysis of machine learning as a service. In Proceedings of the 2017 Internet Measurement Conference, IMC 17, pp. 384 397, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450351188. doi: 10.1145/3131365.3131372. URL https://doi.org/10.1145/3131365.3131372.

Kaichao You, Zhi Kou, Mingsheng Long, and Jianmin Wang. Co-tuning for transfer learning. Advances in Neural Information Processing Systems, 33, 2020.

Eric Zhan, Stephan Zheng, Yisong Yue, Long Sha, and Patrick Lucey. Generating multi-agent trajectories using programmatic weak supervision. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rkxw-h Ac FQ.

Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. Wrench: A comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.

Yivan Zhang, Nontawat Charoenphakdee, and Masashi Sugiyama. Learning from indirect observations, 2019.

Wenxuan Zhou, Hongtao Lin, Bill Yuchen Lin, Ziqi Wang, Junyi Du, Leonardo Neves, and Xiang Ren. Nero: A neural rule grounding framework for label-efficient relation extraction. The Web Conference, 2020.

Published as a conference paper at ICLR 2022

SUPPLEMENTARY MATERIALS FOR CREATING TRAINING SETS VIA WEAK INDIRECT SUPERVISION

The supplementary materials are organized as follows. In Appendix A, we provide a glossary of variables and symbols used in this paper. In Appendix B, we provide the details of PLRM model. In Appendix C and D, we provide the detailed proofs of Theorem 2 and Theorem 1 respectively. In Appendix E, we provide the detailed examples and illustrations of label graph in WIS. In Appendix F and G, we provide experimental details and additional experiment resulst respectively.

A GLOSSARY OF SYMBOLS

Table 4: Glossary of variables and symbols used in this paper.

Symbol Simplified Used for

Xi The i-th data point, Xi X m Number of data points Yi The true desired label of the i-th data point, Yi Y y A semantic label, e.g., "dog" Y The set of desired labels, Y = {y1, y2, . . . , yk} k Cardinality of Y, i.e., k = |Y|

λj The j-th Indirect labeling function (ILF) n Number of ILF ˆY j i The output label of j-th ILF on i-th data point, ˆY j i Yλj ˆYi The concatenation of ILFs output, ˆYi = [ ˆY 1 i , ˆY 2 i , . . . , ˆY n i ] ˆyj A semantic label in the label space of λj Yλj Yj Label label space of ILF λj, Yλj = {ˆyj 1, ˆyj 2, . . . , ˆyj kλj } kλj kj Cardinality of the output space of ILF λ, i.e., kλj = |Yλj| ˆY Union set of all the Yλj, ˆY = {ˆy1, ˆy2, . . . , ˆyˆk} ˆk Cardinality of the ˆY, i.e., ˆk = | ˆY| K Total number of labels, i.e., K = ˆk + k

Y i Latent binary variable indicating whether the data should be assigned ˆyi ˆY. Y Concatenation of all latent binary variable, Y = [ Y 1, . . . , Y ˆk]

G Label graph, G = ( ˆY Y, E) E The set of label relations, E = {(yi, yj, tyiyj)|tyiyj T , i < j, yi, yj V} T The set of label relation types, T = {te, to, tsd, tsg} te Exclusive label relation to Overlap label relation tsg Subsuming label relation tsd Subsumed label relation N(y, ˆY) the set of non-exclusive neighbors of a given label y in ˆY

ϕ A single dependency, or, factor function Φ Concatenation of all individual dependency M Number of total dependencies θ A single parameter of the PGM Θ Concatenation of all parameters of the PGM, Θ RM ˆΘ The learned parameters Θ The golden parameters W The parameter of an end model ˆW The learned parameters W The golden parameters

Published as a conference paper at ICLR 2022

B DETAILS OF THE PLRM

We use Y, Y , and ˆY to represent random vector. Then, we give the formal form of the PLRM as:

PΘ(Y, Y , ˆY ) exp Θ Φ(Y, Y , ˆY ) . (7)

Recall that Y is the unobserved true label, Y is the binary random vector, each of whose binary value Y i reflects whether the data should be assigned seen label ˆyi ˆY, and ˆY is the concatenated outputs of ILFs. Specifically, we enumerate Φ as below:

1. (Pseudo accuracy dependency): j [n], y Y/{unknown}, ˆy Yλj, we have

ϕAcc y,ˆy,j(Y, ˆY j) := 1{Y = y ˆY j = ˆy ˆy N(y, Yλj)}1

2. (Accuracy dependency): j [n], ˆyi ˆY Yj we have

ϕAcc ˆyi,j( Y i, ˆY j) := 1{ Y i = 1 ˆY j = ˆyi}

3. (Label relation dependency between seen labels): ˆyi, ˆyj ˆY, i < j

(a) if tˆyi ˆyj = te, we have

ϕe ˆyi,ˆyj( Y i, Y j) := 1{ Y i = 1 Y j = 1}

(b) if tˆyi ˆyj = to, we have

ϕo ˆyi,ˆyj( Y i, Y j) := 1{ Y i = 1 Y j = 1}

(c) if tˆyi ˆyj = tsg, we have

ϕsg ˆyi,ˆyj( Y i, Y j) := 1{ Y i = 0 Y j = 1}

(d) if tˆyi ˆyj = tsd, we have

ϕsd ˆyi,ˆyj( Y i, Y j) := 1{ Y i = 1 Y j = 0}

4. (Label relation dependency between desired and seen labels): y Y/{unknown}, ˆyi ˆY

(a) if tyˆyi = te, we have

ϕe y,ˆyi(Y, Y i) := 1{Y = y Y i = 1} (b) if tyˆyi = to, we have

ϕo y,ˆyi(Y, Y i) := 1{Y = y Y i = 1} (c) if tyˆyi = tsg, we have

ϕsg y,ˆyi(Y, Y i) := 1{Y = y Y i = 1}

(d) if tyˆyi = tsd, we have

ϕsd y,ˆyi(Y, Y i) := 1{Y = y Y i = 0}

And example of our PLRM is shown in Fig. 4, where square with difference colors corresond to different dependency/factor functions in PLRM.

Pn M8ft Ce M0g=</latexit>Y

Label Relation Dependency True Accuracy Dependency Pseudo Accuracy Dependency

ˆY 1 ˆY 2 ˆY 3

Figure 4: PLRM.

1When ˆy / N(y, Yλj), ϕAcc y,ˆy,j is always zero and will not occur in the model. Here we use this form for the sack of rigorous representation.

Published as a conference paper at ICLR 2022

C PROOF OF THEOREM 2

C.1 SIMPLIFYING THE NOTATION

To simplify the indexing of dependencies, we use Φ1 to represent the concatenation of ϕ which involves both Y and Y , Φ2 to represent the concatenation of ϕ which involve both Y and ˆΛ, and Φ3 to represent the concatenation of remaining ϕ which do not involve Y.

Specifically, Φ1 consists of k ˆk components corresponding to the dependency between the k desired labels and the ˆk seen labels. We use the subscript i, j to denote the dependency function between the desired label yi and seen label ˆyj, i.e.,

Φ1 i,j = ϕ? yi,ˆyj,

where ? is the corresponding relation.

Similarly, Φ2 consists of k (Pn j=1 kj) components corresponding to the dependency between the k desired labels and the kj seen labels output by the ILF λj (j [n]), and we use Φ2 i,j,l to denote the

dependency of yi and ˆyj l , and Φ2 i,j = (Φ2 i,j,l)kj l=1to denote the dependency of yi and ˆyj.

According to Φ1, Φ2, and Φ3, we also divide the parameter Θ into Θ1 (with elements being Θ1 i,j correspondingly), Θ2 (with elements being Θ2 i,j = (Θ2 i,j,l)kj l=1 correspondingly), and Θ3, and the joint probability is then given as:

PΘ(Y, Y , ˆY ) = exp (Θ1)TΦ1(Y, Y ) + (Θ2) Φ2(Y, ˆY ) + (Θ3)TΦ3( Y , ˆY )

Y , Y , ˆY exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY ) + (Θ3)TΦ3( Y , ˆY ) (8)

Also, for notation convenience, we adopt following simplifications:

1. yi Y i [k] since |Y| = k, similarly, ˆyi Yj i [kj] and ˆyi ˆY i [ˆk];

2. λj j [n] since we have n ILFs in total;

3. ϕ tyiyj yi,yj ϕt yi,yj where t = tyiyj and can be seen from the subscript of the dependency.

C.2 PROPOSITIONS AND LEMMAS

First, we state some propositions and lemmas that will be useful in the proof to come.

Proposition 1 (Multi-class classification). For a multi-class classification task, yi, yj Y, we have tyiyj = te. Similarly, ˆya, ˆyb ˆY, we have tˆya ˆyb = te.

Lemma 1. For a consistent label graph G and ˆyl ˆY, yi, yj Y, if tyi ˆyl = to, we have tyj ˆyl = tsg .

Proof. For yi, yj Y, based on Proposition 1, we know tyiyj = te, which implies (1) the intersection of the sets labeled by yi and yj is empty. For ˆyl ˆY, if tyi ˆyl = to, we have (2) the intersection of the sets labeled by yi and ˆyl is not empty. If tyj ˆyl = tsg, which implies (3) yj ˆyl. Based on (2)(3), we have yi yj = , which is contradictory to (1). Thus, we prove when tyi ˆyl = to, tyj ˆyl = tsg.

Lemma 2. For an informative ILF λj and given any yd Y, there exists some ˆyl Yj, such that, Φ2 d,j,l(yd, ˆy) = 0, l [kj].

Proof. Because ILF λj is informative, we know there exists one ˆya Yj such that ˆya is exclusive to yd, i.e., ˆya / N(yd, Yj). Therefore, for any ˆyl ˆY, either ˆya = ˆyl, or ˆyl = ˆya / Yj, which leads to the conclusion by the definition of Φ2 d,j,l = ϕAcc yd,ˆyl,j.

Published as a conference paper at ICLR 2022

C.3 DEFINITIONS

Before the main proof, we connect the indistinguishablity of label relation structure with the dependency structure of PLRM by introducing the concept of symmetry as follows:

Definition 4 (Symmetry). For yi, yj Y, we say yi and yj have symmetric dependency structure if the following equation holds:

Φ1 i,l = Φ1 j,l, l ˆk;

Φ2 i,a,b = Φ2 j,a,b, a [n], b [ka]. (9)

Based on the construction of PLRM, we know that for yi, yj Y, ˆyb ˆY, tyi ˆyb = tyj ˆyb (the statement in Theorem 2) is equivalent to yi and yj have symmetric dependency structure.

C.4 EQUIVALENT STATEMENT OF THEOREM 2

Our main result states that asymmetric is equivalent to distinguishable as in the following theorem, which can readily be seen to be identical to Theorem 2 in the main body of the paper:

Theorem 3. For a probability model defined as Eq. (8) induced from a consistent label graph and informative ILFs, for any pair of yi, yj Y, yi and yj are distinguishable if and only if they have asymmetric dependency structure.

C.5 PROOF OF THE NECESSITY IN THEOREM 3: NECESSARY CONDITION

We first prove that for any yi, yj Y, yi and yj have asymmetric dependency structure is the necessary condition of that they are distinguishable.

Proof of Theorem 3. We prove this theorem by reduction to absurdity. Suppose yi and yj are symmetric. Then, by Eq. (8), the distribution of Y condition on any Y and ˆY can be calculated as follows:

PΘ(Y = yi| Y , ˆY ) = PΘ(yi, Y , ˆY )

PΘ( Y , ˆY ) .

On the other hand, applying Y = yi in the definition of Φ2 leads to

Φ2 r,a,l(yi, ) = 0, r [k], r = i, a [n], l [ka].

We further separate Φ1 into (Φ1 i )k i=1, where Φ1 i collects all the dependency in Φ1 with yi involved, i.e.,

Φ1 i = (Φ1 i,j) ˆk j=1,

with the corresponding parameters respectively denoted as Θ1 i with Θ1 = (Θ1 i )k i=1. Similarly, Φ2 is also divided into (Φ2 i )k i=1 following the same routine and Θ2 is respectively divided into (Θ2 i )k i=1. Specifically, if yi and yj are symmetric, we further have

Φ1 i = Φ1 j, Φ2 i = Φ2 j.

Based on the notation, PΘ(Y = yi| Y , ˆY ) can then be represented as

PΘ(Y = yi| Y , ˆY )

Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY ) + (Θ3)TΦ3( Y , ˆY ) !

l=1 (Θ1 l )TΦ1 l (yi, Y ) +

l=1 (Θ2 l )TΦ2 l (yi, ˆY ) + (Θ3)TΦ3( Y , ˆY )

Published as a conference paper at ICLR 2022

which further leads to

PΘ(Y = yi| Y , ˆY )

Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY ) !

l=1 (Θ1 l )TΦ1 l (yi, Y ) + (Θ2 i )TΦ2 i (yi, ˆY )

which is independent of Θ3. Similarly,

PΘ(Y = yj| Y , ˆY )

Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY ) !

l=1 (Θ1 l )TΦ1 l (yi, Y ) + (Θ2 j)TΦ2 j(yj, ˆY )

and l [k]/{i, j},

PΘ(Y = yl| Y , ˆY )

Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY ) !

l=1 (Θ1 l )TΦ1 l (yi, Y ) + (Θ2 l )TΦ2 l (yl, ˆY )

Let Θ be defined as follows: Θ1 i = Θ1 j, Θ1 j = Θ1 i , Θ1 l = Θ1 l , l / {i, j},

Θ2 i = Θ2 j, Θ2 j = Θ2 i , Θ2 l = Θ2 l , l / {i, j}, and Θ3 = Θ3. We then have

PΘ(Y = yi| Y , ˆY )

P Θ(Y = yj| Y , ˆY )

Y exp ( Θ1) Φ1(Y , Y ) + ( Θ2) Φ2(Y , ˆY )

Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY )

exp ((Θ1 i )T(Φ1 i (yi, Y ) Φ1 j(yj, Y )) + (Θ1 j)T(Φ1 j(yi, Y ) Φ1 i (yj, Y )) + (Θ2 i )T(Φ2 i (yi, ˆY ) Φ2 j(yj, ˆY )))

Y exp ( Θ1) Φ1(Y , Y ) + ( Θ2) Φ2(Y , ˆY )

Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY ) .

PΘ(Y = yj| Y , ˆY )

P Θ(Y = yi| Y , ˆY )

Y exp ( Θ1) Φ1(Y , Y ) + ( Θ2) Φ2(Y , ˆY )

Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY )

exp ((Θ1 j)T(Φ1 j(yj, Y ) Φ1 i (yi, Y )) + (Θ1 i )T(Φ1 i (yj, Y ) Φ1 j(yi, Y )) + (Θ2 j)T(Φ2 j(yj, ˆY ) Φ2 i (yi, ˆY )))

Y exp ( Θ1) Φ1(Y , Y ) + ( Θ2) Φ2(Y , ˆY )

P Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY ) .

Published as a conference paper at ICLR 2022

and l [k]/{i, j},

PΘ(Y = yl| Y , ˆY )

P Θ(Y = yl| Y , ˆY ) =

Y exp ( Θ1) Φ1(Y , Y ) + ( Θ2) Φ2(Y , ˆY )

Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY ) .

Similarly, we have

PΘ(Y = unknown| Y , ˆY )

P Θ(Y = unknown| Y , ˆY ) =

Y exp ( Θ1) Φ1(Y , Y ) + ( Θ2) Φ2(Y , ˆY )

Y exp (Θ1) Φ1(Y , Y ) + (Θ2) Φ2(Y , ˆY ) .

Therefore, we have

PΘ(Y = yi| Y , ˆY )

P Θ(Y = yj| Y , ˆY ) = PΘ(Y = yj| Y , ˆY )

P Θ(Y = yi| Y , ˆY ) = PΘ(Y = y| Y , ˆY )

P Θ(Y = y| Y , ˆY ) , y Y/{yi, yj}.

Since PΘ(Y = yi| Y , ˆY ) + PΘ(Y = yj| Y , ˆY ) + X

l =i,j PΘ(Y = yl| Y , ˆY ) = 1,

and P Θ(Y = yj| Y , ˆY ) + P Θ(Y = yi| Y , ˆY ) + X

l =i,j P Θ(Y = yl| Y , ˆY ) = 1,

we obtain that PΘ(Y = yi| Y , ˆY ) = P Θ(Y = yj| Y , ˆY )

PΘ(Y = yj| Y , ˆY ) = P Θ(Y = yi| Y , ˆY )

PΘ(Y = yl| Y , ˆY ) = P Θ(Y = yl| Y , ˆY ), which indicates yi and yj indistinguishable, and leads to a contradictory.

The proof is completed.

C.6 PROOF OF THEOREM 3: SUFFICIENT CONDITION

We then prove that for any yi, yj Y, yi and yj have asymmetric dependency structure is the sufficient condition of that they are distinguishable.

Proof. We use the same notations (Θ1 i )k i=1, (Θ2 i )k i=1, and Θ3 in Appendix C.5 to denote the separation of the parameter Θ. Let Θ be any parameter satisfying that there exists a parameter Θ, such that Eq. (4-5) holds. By Eqs. (10), (11), and Eq. (12) together with Eqs. (4-5), we have r [k], r = i, j,

exp ((Θ1 i )TΦ1 i (yi, Y ) + (Θ1 j)TΦ1 j(yi, Y ) + (Θ2 i )TΦ2 i (yi, ˆY ))

exp (( Θ1 i )TΦ1 i (yj, Y ) + ( Θ1 j)TΦ1 j(yj, Y ) + ( Θ2 j)TΦ2 j(yj, ˆY ))

=exp ((Θ1 i )TΦ1 i (yj, Y ) + (Θ1 j)TΦ1 j(yj, Y ) + (Θ2 j)TΦ2 j(yj, ˆY ))

exp (( Θ1 i )TΦ1 i (yi, Y ) + ( Θ1 j)TΦ1 j(yi, Y ) + ( Θ2 i )TΦ2 i (yi, ˆY ))

=exp ((Θ1 i )TΦ1 i (yr, Y ) + (Θ1 j)TΦ1 j(yr, Y ))

exp (( Θ1 i )TΦ1 i (yr, Y ) + ( Θ1 j)TΦ1 j(yr, Y )) = exp ((Θ1 i )TΦ1 i (yj, Y ) + (Θ1 j)TΦ1 j(yi, Y ))

exp (( Θ1 i )TΦ1 i (yj, Y ) + ( Θ1 j)TΦ1 j(yi, Y )) .

By simple rearranging, we have

((Θ1 i )TΦ1 i (yi, Y ) + (Θ1 j)TΦ1 j(yi, Y ) + (Θ2 i )TΦ2 i (yi, ˆY ) + (Θ2 j)TΦ2 j(yi, ˆY ))

(( Θ1 i )TΦ1 i (yj, Y ) + ( Θ1 j)TΦ1 j(yj, Y ) + ( Θ2 i )TΦ2 i (yj, ˆY ) + ( Θ2 j)TΦ2 j(yj, ˆY ))

=((Θ1 i )TΦ1 i (yj, Y ) + (Θ1 j)TΦ1 j(yj, Y ) + (Θ2 i )TΦ2 i (yj, ˆY ) + (Θ2 j)TΦ2 j(yj, ˆY ))

(( Θ1 i )TΦ1 i (yi, Y ) + ( Θ1 j)TΦ1 j(yi, Y ) + ( Θ2 i )TΦ2 i (yi, ˆY ) + ( Θ2 j)TΦ2 j(yi, ˆY ))

=((Θ1 i )TΦ1 i (yj, Y ) + (Θ1 j)TΦ1 j(yi, Y )) (( Θ1 i )TΦ1 i (yj, Y ) + ( Θ1 j)TΦ1 j(yi, Y )). (13)

Published as a conference paper at ICLR 2022

By the equality between the second term and the third term in Eq. (13), we obtain that

(Θ1 j)TΦ1 j(yi, Y ) ( Θ1 i )TΦ1 i (yj, Y )

=((Θ1 j)TΦ1 j(yj, Y ) + (Θ2 j)TΦ2 j(yj, ˆY )) (( Θ1 i )TΦ1 i (yi, Y ) + ( Θ2 i )TΦ2 i (yi, ˆY )). (14)

We further set Y in Eq. (14) respectively to el (the one hot vector with its l-th position being 1) and 0 for any fixed l [ˆk], i.e.,

((Θ1 j)TΦ1 j(yi, el) (Θ1 j)TΦ1 j(yi, 0)) (( Θ1 i )TΦ1 i (yj, el) ( Θ1 i )TΦ1 i (yj, 0))

=((Θ1 j)TΦ1 j(yj, el) (Θ1 j)TΦ1 j(yj, 0)) (( Θ1 i )TΦ1 i (yi, el) ( Θ1 i )TΦ1 i (yi, 0)),

which by simple rearranging further leads to

Θ1 j,l(Φ1 j,l(yj, 1) Φ1 j,l(yj, 0) Φ1 j,l(yi, 1)) = Θ1 i,l(Φ1 i,l(yi, 1) Φ1 i,l(yi, 0) Φ1 i,l(yj, 1)).

Since Θ1 j,l, Θ1 i,l > 0, and by definition we have

|Φ1 j,l(yj, 1) Φ1 j,l(yj, 0) Φ1 j,l(yi, 1)| = 1,

and |Φ1 i,l(yi, 1) Φ1 i,l(yi, 0) Φ1 i,l(yj, 1)| = 1,

we obtain Θ1 j,l = Θ1 i,l, and

Φ1 j,l(yj, 1) Φ1 j,l(yj, 0) Φ1 j,l(yi, 1) = Φ1 i,l(yi, 1) Φ1 i,l(yi, 0) Φ1 i,l(yj, 1). (15)

Therefore, either tyj ˆyl {to, tsd, tsg} and tyi ˆyl {to, tsd, tsg}, or tyj ˆyl = te and tyi ˆyl = te, which by definition further indicates that Φ2 i = Φ2 j (recall the way we build dependency between Y and ˆY ).

As l is arbitrarily picked, we then have Θ1 j is equal to Θ1 i component-wisely.

By the equality between the first term and the third term in Eq. (13) and following exact the same routine, we also have Θ1 j = Θ1 i .

On the other hand, for any r [ˆk], fixing Y and ˆY s ( s = r), and setting ˆYr = ˆyr l (l kr, ˆyr l N(yj, Yl)) in Eq. (14), we have

(Θ1 j)TΦ1 j(yj, Y ) + ( Θ1 i )TΦ1 i (yj, Y ) + Θ2 j,r,lΦ2 j,r,l(yj, ˆyr l ) + X

s =r Θ2 j,sΦ2 j,s(yj, Y s)

=(Θ1 j)TΦ1 j(yi, Y ) + ( Θ1 i )TΦ1 i (yi, Y ) + Θ2 i,r,lΦ2 i,r,l(yi, ˆyr l ) + X

s =r Θ2 i,sΦ2 i,s(yi, ˆY s).

On the other hand, by Lemma 2, there exists some p, s.t., ˆyr p / N(yj, Yr) (which by Φ2 i = Φ2 j further leads to ˆyr p / N(yi, Yr)). Setting ˆYr = ˆyr l leads to

(Θ1 j)TΦ1 j(yj, Y ) + ( Θ1 i )TΦ1 i (yj, Y ) + X

s =r Θ2 j,sΦ2 j,s(yj, Y s)

=(Θ1 j)TΦ1 j(yi, Y ) + ( Θ1 i )TΦ1 i (yi, Y ) + X

s =r Θ2 i,sΦ2 i,s(yi, ˆY s).

Subtracting the above two equations leads to Θ2 j,a,l = Θ2 i,a,l. Since a and l are arbitrarily picked, we conclude that Θ2 j = Θ2 i . Following the same routine, we also have Θ2 i = Θ2 j.

Therefore, by applying Θ1 j = Θ1 i , Θ1 i = Θ1 j, Θ2 j = Θ2 i , and Θ2 i = Θ2 j in Eq. (13), we have

(Θ1 i )TΦ1 i (yi, Y ) (Θ1 i )TΦ1 j(yj, Y ) = (Θ1 i )TΦ1 i (yj, Y ) (Θ1 i )TΦ1 j(yi, Y ),

(Θ1 j)TΦ1 j(yj, Y ) (Θ1 j)TΦ1 i (yi, Y ) = (Θ1 j)TΦ1 j(yi, Y ) (Θ1 j)TΦ1 i (yj, Y ).

Published as a conference paper at ICLR 2022

Let Y = 1ˆk (i.e., the ˆk-dimension all 1 vector), we have

(Θ1 i )T((Φ1 i (yi, 1ˆk) Φ1 i (yj, 1ˆk)) ((Φ1 j(yj, 1ˆk) Φ1 j(yi, 1ˆk)))) = 0, (16)

(Θ1 j)T((Φ1 i (yi, 1ˆk) Φ1 i (yj, 1ˆk)) ((Φ1 j(yj, 1ˆk) Φ1 j(yi, 1ˆk)))) = 0. (17)

Since yi and yj are asymmetric, we have that there exists l, such that tyi ˆyl = tyj ˆyl. Concretely, by Eq. (15), we have tyi ˆyl {to, tsd, tsg}, tyj ˆyl = {to, tsd, tsg}, and tyi ˆyl = tyj ˆyl. On the other hand,

Φ1 i,l(yi, 1) Φ1 i,l(yj, 1)) = Φ1 j,l(yj, 1) Φ1 j,l(yi, 1),

if and only if tyi ˆyl = to, tyj ˆyl = tsg, or tyj ˆyl = to, tyi ˆyl = tsg, which contradicts Lemma 1.

Therefore, Φ1 i,l(yi, 1) Φ1 i,l(yj, 1)) = Φ1 j,l(yj, 1) Φ1 j,l(yi, 1).

In this case, solutions of Θ1 i , Θ1 j subject to respectively Eqs. (16) and (17) lie along a zero-measure set.

The proof is completed.

D PROOF OF THEOREM 1

D.1 LEARNING ALGORITHM

We first present the algorithm for producing ˆΘ and ˆW in Algorithm 1.

Algorithm 1 WIS

Require: Step size η, dataset D X, and initial parameter Θ0.

ˆΘ Θ0. for all X D do

Independently sample (Y, Y , ˆY ) from π ˆΘ, and (Y , Y , ˆY ) from π ˆΘ conditionally given ˆY = ˆY (X). ˆΘ ˆΘ + η(Φ(Y, Y , ˆY ) Φ(Y , Y , ˆY )). Compute ˆW as described in (3) using ˆΘ. output (ˆΘ, ˆW)

D.2 ASSUMPTIONS

First, the problem distribution π needs to be accurately modeled by some distribution Θ in the family that we are trying to learn:

Θ s.t. (Y, ˆY ), p(X,Y ) π (Y, ˆY ) = pθ (Y, ˆY ). (18)

Secondly, given an example (X, Y ) π , we assume Y is independent of X given ˆY (X):

(X, Y ) π Y X | ˆY (X). (19) This assumption encodes the idea that while the ILFs can be arbitrarily dependent on the features, they provide sufficient information to accurately identify the true label vector. Then, for any Θ, accurately learning Θ from data distribution is possible. That is, there exists an unbiased estimator ˆΘ(D) which is a function of the dataset D of i.i.d from πΘ, such that, for any Θ and some c > 0,

Cov(ˆΘ(D)) I 2c|D|. (20)

And we are reasonably certain in our guess of latent variables, i.e., Y and Y . That is, for any Θ, Θ ,

i=1 (ni + ˆk)Var(Y, Y , ˆY ) πΘ(1Y =yi| ˆY = ˆY )2 +

i=1 (mi + K 1)Var(Y, Y , ˆY ) πΘ( Y i| ˆY = ˆY )2 # 1

Published as a conference paper at ICLR 2022

We also assume that the output of the last layer of end model h W has bounded ℓ norm, that is, for any possible parameter W, h W H. (22)

Finally, we assume that solving Eq. (3) has bounded generalization risk such that for some χ > 0, solution ˆW satisfies E ˆ W h ℓˆΘ( ˆ W) min W ℓˆΘ(W) i χ. (23)

D.3 PROOF OF THEOREM 1

To begin with, we state two basic lemmas needed for proofs throughout this section: Lemma D.1. Let x1, x2 be two binary random variable. Then we have variance of product of x1 and x2 can be bounded as

Var [x1x2] Var [x1] + Var [x2] .

Lemma D.2. Let Y be a random vector and s be the spectral norm. Then we have

Cov(Y, Y ) s X

Then, we borrow two lemmas from (Ratner et al., 2016), which are slightly different from the original ones but can be easily proved following the same derivations: Lemma D.3. [Lemma D.1 in (Ratner et al., 2016)] Given a family of maximum-entropy distributions

πΘ(Y, Y , ˆY ) = 1 ZΘ exp (ΘTΦ(Y, Y , ˆY )).

If we let J be the maximum expected log-likelihood objective, under another distribution π , for the event associated with the observed labeling function values ˆY ,

J(Θ) = E(Y , Y , ˆY ) π h log P(Y, Y , ˆY ) πΘ

ˆY = ˆY i ,

then its Hessian can be calculated as

2J(Θ) = E(Y , Y , ˆY ) π h Cov(Y, Y , ˆY ) πΘ

ϕ(Y, Y , ˆY ) | ˆY = ˆY i Cov(Y, Y , ˆY ) πΘ(ϕ(Y, Y , ˆY )).

Lemma D.4. [Lemma D.4 in (Ratner et al., 2016)] Suppose that we are looking at a WIS maximum likelihood estimation problem and the objective function J(Θ) is strongly concave with concavity parameter c > 0. If we run stochastic gradient descent using unbiased samples from a true distribution πΘ , then if we set step size as

and run (using a fresh sample at each iteration) for T steps, where

T = 2 c2ϵ2 log

We can bound the expected parameter estimation error with

E ˆΘ Θ 2 Mϵ2, (24)

where M is the dimension of Θ.

Based on Lemma D.4, in order to obtain the optimization error with respect to the estimated ˆΘ produced by Algorithm 1, we only need to show that the WIS object function J(Θ)2 is strongly concave. We prove this through the following lemma, which is a non-trivial extension of Lemma D.3 in (Ratner et al., 2016) given the fact that we have multiple latent variables and relatively complex dependency structures with comparison to (Ratner et al., 2016):

2Note that, in the Eq. (2) of the main body of the paper, we are minimizing J(Θ), which is equivalent to maximizing J(Θ) as discussed here.

Published as a conference paper at ICLR 2022

Lemma D.5. [Extension of Lemma D.3 in (Ratner et al., 2016)] With conditions (20) and (21), the WIS objective function J(Θ) is strongly concave with strong convexity c.

We then come to bound the generalization error of ˆW produced by Algorithm 1, using the following non-trivial extension of Lemma D.5 in (Ratner et al., 2016):

Lemma D.6. [Extension of Lemma D.5 in (Ratner et al., 2016)] Suppose that conditions (18)-(23) hold. Let ˆW be the learned parameters of the end model produced by Algorithm 1, and ℓ(W ) be the minimum of cross entropy loss function ℓ. Then, we can bound the expected risk with

E h ℓ( ˆW) ℓ(W ) i χ + 4c Hϵ.

Finally, we conclude Lemmas (D.4), (D.5) and (D.6) as the following theorem, which is identical to the Theorem 1 in the main body of the paper:

Theorem 4 (Extension of Theorem 2 in (Ratner et al., 2016)). Suppose that we run Algoirthm 1 on a WIS specification to produce ˆΘ and ˆW, and all conditions of Lemmas (D.5) and (D.6) are satisfied. Then, for any ϵ > 0, if we set the step size to be

and the input dataset D is large enough such that

|D| > 2 c2ϵ2 log

then we can bound the expected parameter error and the expected risk as:

E ˆΘ Θ 2 Mϵ2, E h ℓ( ˆW) ℓ(W ) i χ + 4c Hϵ.

D.4 PROOFS OF LEMMAS

Lemma D.1. Let x1, x2 be two binary random variable. Then we have variance of product of x1 and x2 can be bounded as

Var [x1x2] Var [x1] + Var [x2] .

Proof. Joint distribution of x1 and x2 can be listed as the following table: (where p1+p2+p3+p4 = 1)

x1/x2 0 1 0 p1 p2 1 p3 p4

Then we have Var [x1x2] = p4 p2 4 = p4(p1 + p2 + p3),

Var [X1] + Var [X2] = (p2 + p4)(p1 + p3) + (p3 + p4)(p1 + p2) p4(p1 + p2 + p3).

The proof is completed.

Lemma D.2. Let Y be a random vector and s be the spectral norm. Then we have

Cov(Y, Y ) s X

Published as a conference paper at ICLR 2022

Proof. By definition of spectral norm, we have

Cov(Y, Y ) s = max x 2 1 x TCov(Y, Y )x

Where x is a constant vector. And by Cauchy-Schwarz inequality,

x TCov(Y, Y )x = E x T(Y E [Y ])(Y E [Y ])Tx E h x 2 Y E [Y ] 2i .

Because x is a constant vector and x 1,

max x 2 1 E h x 2 Y E [Y ] 2i

= max x 2 1 x 2 E h Y E [Y ] 2i

= max x 2 1 x 2 "X

The proof is completed.

Lemma D.5. [Extension of Lemma D.3 in (Ratner et al., 2016)] With conditions (20) and (21), the WIS objective function J(Θ) is strongly concave with strong convexity c.

Proof. By Lemma D.3, hessian matrix of J can be decomposed as follows:

2J(Θ) = E ˆY πΘ h Cov(Y, Y , ˆY ) πΘ

Φ(Y, Y , ˆY ) | ˆY = ˆY i Cov(Y, Y , ˆY ) πΘ(Φ(Y, Y , ˆY )).

Basically, to prove that J(Θ) is strongly concave with strong convexity c, we need to show for a real number c > 0,

We calculate each term separately: for the first term

A = E ˆY πΘ h Cov(Y, Y , ˆY ) πΘ

Φ(Y, Y , ˆY ) | ˆY = ˆY i ,

since A is symmetric, for any real number c, A c I, if and only if its spectral norm A s c, where A s equals to the eigenvalue of A with largest absolute value.

Since by definition, vector function Φ(Y, Y , ˆY ) can be represented as:

Φ(Y, Y , ˆY ) =

ϕAcc yi,ˆyj l ,j(Y, ˆY j)

i [k],j [n],ˆyj l N(yi,Yj) ϕAcc ˆyi,j( Y i, ˆY j)

j [n],ˆyi Yj ϕt ˆyi,ˆyj( Y i, Y j)

i,j [ˆk] ϕt yi,ˆyj(Y, Y j)

i [k],j [ˆk]

Published as a conference paper at ICLR 2022

by Lemma D.2, we have A can be further bounded by

ˆyj l N(yi,Yj) Var(Y, Y , ˆY ) πΘ

ϕAcc yi,ˆyj l ,j(Y, ˆY j) | ˆY = ˆY

ˆyi Yj Var(Y, Y , ˆY ) πΘ

ϕAcc ˆyi,j( Y i, ˆY j) | ˆY = ˆY

1 i,j ˆk Var(Y, Y , ˆY ) πΘ

ϕt ˆyi,ˆyj( Y i, Y j) | ˆY = ˆY

j=1 Var(Y, Y , ˆY ) πΘ

ϕt yi,ˆyj(Y, Y j) | ˆY = ˆY

=A1 + A2 + A3 + A4,

A1 =E ˆY πΘ

ˆyj l N(yi,Yj) Var(Y, Y , ˆY ) πΘ

ϕAcc yi,ˆyj l ,j(Y, ˆY j) | ˆY = ˆY

A2 =E ˆY πΘ

ˆyl Yj Var(Y, Y , ˆY ) πΘ

ϕAcc ˆyl,j(Y, ˆY j) | ˆY = ˆY

A3 =E ˆY πΘ

1 i,j ˆk Var(Y, Y , ˆY ) πΘ

ϕt ˆyi,ˆyj( Y i, Y j) | ˆY = ˆY

A4 =E ˆY πΘ

j=1 Var(Y, Y , ˆY ) πΘ

ϕt yi,ˆyj(Y, Y j) | ˆY = ˆY

We then bound the four terms respectively. As for A1, for fixed ˆY , we have

ˆyj l N(yi,Yj) Var(Y, Y , ˆY ) πΘ

ϕAcc yi,ˆyj l ,j(Y, ˆY j) | ˆY = ˆY

ˆyj l N(yi,Yj) Var(Y, Y , ˆY ) πΘ

1Y =yi ˆY j=ˆyj l | ˆY = ˆY

j [n],ˆyj l N(yi,Yj),( ˆY )j=ˆyj l

Var(Y, Y , ˆY ) πΘ

1Y =yi | ˆY = ˆY

j [n],ˆyj l N(yi,Yj),( ˆY )j=ˆyj l

Var(Y, Y , ˆY ) πΘ

1Y =yi | ˆY = ˆY

i=1 ni Var(Y, Y , ˆY ) πΘ

1Y =yi | ˆY = ˆY ,

where ni is the number of ILFs whose label space contains label that is non-exclusive to label yi, i.e., ni = |{j [n]|N(yi, Yj) = }|.

Therefore, we have

i=1 ni E ˆY πΘ Var(Y, Y , ˆY ) πΘ

1Y =yi | ˆY = ˆY .

Published as a conference paper at ICLR 2022

Similarly, for A2, we have

i=1 mi E ˆY πΘ Var(Y, Y , ˆY ) πΘ

Y i | ˆY = ˆY ,

where mi is the number of ILFs whose label space contains the label ˆyi.

As for A3, for fixed ˆY and any ˆyi, ˆyj ˆY, we further separate the proof into subcases by tˆyi ˆyj which is simplified as t:

(1). t = to. In this case,

Var(Y, Y , ˆY ) πΘ

ϕt ˆyi,ˆyj( Y i, Y j) | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

1 Y i= Y j | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

Y i Y j | ˆY = ˆY

( ) Var(Y, Y , ˆY ) πΘ

Y i | ˆY = ˆY + Var(Y, Y , ˆY ) πΘ

Y j | ˆY = ˆY ,

where Eq. ( ) is due to Lemma D.1.

(2). t = te. Similarly,

Var(Y, Y , ˆY ) πΘ

ϕt ˆyi,ˆyj( Y i, Y j) | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

1 Y i= Y j=1 | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

1 Y i= Y j=1 | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

Y i Y j | ˆY = ˆY

Var(Y, Y , ˆY ) πΘ

Y i | ˆY = ˆY + Var(Y, Y , ˆY ) πΘ

Y j | ˆY = ˆY ,

(3). t = tsg. In this case,

Var(Y, Y , ˆY ) πΘ

ϕt ˆyi,ˆyj( Y i, Y j) | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

1 Y i=1, Y j=0 | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

(1 Y i) Y j | ˆY = ˆY

Var(Y, Y , ˆY ) πΘ

1 Y i | ˆY = ˆY + Var(Y, Y , ˆY ) πΘ

Y j | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

Y i | ˆY = ˆY + Var(Y, Y , ˆY ) πΘ

Y j | ˆY = ˆY ,

(4). t = tsd. Similar to (3).,

Var(Y, Y , ˆY ) πΘ

ϕt ˆyi,ˆyj( Y i, Y j) | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

1 Y i=0, Y j=1 | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

(1 Y j) Y i | ˆY = ˆY

Var(Y, Y , ˆY ) πΘ

1 Y j | ˆY = ˆY + Var(Y, Y , ˆY ) πΘ

Y i | ˆY = ˆY

=Var(Y, Y , ˆY ) πΘ

Y i | ˆY = ˆY + Var(Y, Y , ˆY ) πΘ

Y j | ˆY = ˆY ,

Published as a conference paper at ICLR 2022

Combining (1), (2), (3), and (4), we have

i=1 (ˆk 1)E ˆY πΘ Var(Y, Y , ˆY ) πΘ

Y i | ˆY = ˆY ,

As for A4, by similar discussion of A3,

i=1 k E ˆY πΘ Var(Y, Y , ˆY ) πΘ

Y i | ˆY = ˆY +

i=1 ˆk E ˆY πΘ Var(Y, Y , ˆY ) πΘ

1Y =yi | ˆY = ˆY .

Combining estimation of A1, A2, A3, A4, and by condition (21) we have

A s A1 + A2 + A3 + A4

i=1 (ni + ˆk)Var Y, ˆY (1Y =yi| ˆY = ˆY ) +

i=1 (mi + K 1)Var Y, ˆY ( Y i| ˆY = ˆY )

i=1 (ni + ˆk)Var2 Y, ˆY (Y | ˆY = ˆY ) +

i=1 (mi + K 1)Var2 Y, ˆY ( Y i| ˆY = ˆY )

i=1 (ni + ˆk) +

i=1 (mi + K 1)

which further leads to A c I.

For the second term B = Cov(Y, Y , ˆY ) πΘ(Φ(Y, Y , ˆY )),

B = E(Y, Y , ˆY ) πΘ

h (Φ(Y, Y , ˆY ) E(Y, Y , ˆY ) πΘ[Φ(Y, Y , ˆY )])2i

= EY, Y , ˆY πΘ

Φ(Y, Y , ˆY )

Y , Y , ˆY Φ(Y , Y , ˆY ) exp ΘT Φ(Y , Y , ˆY )

Y , Y , ˆY exp ΘT Φ(Y , Y , ˆY )

= EY, Y , ˆY πΘ

Θ log exp ΘT Φ(Y, Y , ˆY ) Θ log

Y , Y , ˆY exp ΘT Φ(Y , Y , ˆY )

= E(Y, Y , ˆY ) πΘ

Θ log πΘ(Y, Y , ˆY ) 2 ,

where E(Y, Y , ˆY ) πΘ

Θ log πΘ(Y, Y , ˆY ) 2 is the Fisher Information of Θ. By the Cramér-Rao

bound and the condition (20),

I 2c|D| Cov(ˆΘ) DE(Y, ˆY ) πΘ

Θ log πΘ(Y, ˆY ) 2 1 ,

which further leads to

B = E(Y, Y , ˆY ) πΘ

Θ log πΘ(Y, Y , ˆY ) 2 2c I.

The proof is completed by putting estimation of terms A and B together.

Published as a conference paper at ICLR 2022

Lemma D.6. [Extension of Lemma D.5 in (Ratner et al., 2016)] Suppose that conditions (18)-(23) hold. Let ˆW be the learned parameters of the end model produced by Algorithm 1, and ℓ(W ) be the minimum of cross entropy loss function ℓ. Then, we can bound the expected risk with

E h ℓ( ˆW) ℓ(W ) i χ + 4c Hϵ.

Proof. We begin by rewriting objective of expected loss minimization problem using law of total expectation as follows:

ℓ(W) =E(X,Y ) π E(X,Y ) π [H(Y, σ(h(X, W)))|X]

=E(X ,Y ) π E(X,Y ) π [H(Y, σ(h(X, W)))|X = X ]

=E(X ,Y ) π E(X,Y ) π [H(Y, σ(h(X , W)))|X = X ]

and by our conditional independence assumption (condition (19)), we have

P(Y |X = X ) = P(Y | ˆY (X) = ˆY (X )),

which further leads to

ℓ(W) =E(X ,Y ) π h E(X,Y ) π h H(Y, σ(h(X , W))) ˆY (X) = ˆY (X ) ii

=E(X ,Y ) π h E(Y, ˆY ) πΘ h H(Y, σ(h(X , W))) ˆY = ˆY (X ) ii

On the other hand, if we are minimizing the model with learned parameter ˆΘ, we will be actually minimizing

ℓˆΘ(W) = E(X ,Y ) π h E(Y, ˆY ) π ˆ Θ

h H(Y, σ(h(X , W))) ˆY = ˆY (X ) ii ,

where for any X , E(Y, ˆY ) π ˆ Θ

h H(Y, σ(h(X , W))) ˆY = ˆY (X ) i can be further calculated as

E(Y, ˆY ) π ˆ Θ

h H(Y, σ(h(X , W))) ˆY = ˆY (X ) i

l=1 log (σ(h(X , W))l) P(Y, ˆY ) π ˆ Θ(Y = yl| ˆY = ˆY (X )).

For simplification, we rewrite P(Y, ˆY ) π ˆ Θ(Y = yl| ˆY = ˆY (X )) as follows with slight abuse of notations:

P(Y, ˆY ) π ˆ Θ(Y = yl| ˆY = ˆY (X )) = Pπ ˆ Θ(yl| ˆY (X )),

and similarly

E(X ,Y ) π = Eπ ,

Let l X = arg minl log (σ(h(X , W))l). The difference between the loss functions will be

|ℓˆΘ(W) ℓ(W)| =

l=1 log σ(h(X , W))l PπΘ (yl| ˆY (X )) Pπ ˆ Θ(yl| ˆY (X )) #

= Eπ h log σ(h(X , W))l X PπΘ (yl X | ˆY (X )) Pπ ˆ Θ(yl X | ˆY (X )) i

l =l X log σ(h(X , W))l PπΘ (yl| ˆY (X )) Pπ ˆ Θ(yl| ˆY (X ))

Published as a conference paper at ICLR 2022

Furthermore, Eπ h log σ(h(X , W))l X PπΘ (yl X | ˆY (X )) Pπ ˆ Θ(yl X | ˆY (X )) i

l =l X log (σ(h(X , W))l) PπΘ (yl| ˆY (X )) Pπ ˆ Θ(yl| ˆY (X ))

log σ(h(X , W))l X

l =l X PπΘ (yl| ˆY (X )) + X

j =l X Pπ ˆ Θ(yl| ˆY (X ))

l =l X log (σ(h(X , W))l) PπΘ (yl| ˆY (X )) Pπ ˆ Θ(yl| ˆY (X ))

log (σ(h(X , W))l) log σ(h(X , W))l X PπΘ (yl| ˆY (X )) Pπ ˆ Θ(yl| ˆY (X ))

h(X , W)l h(X , W)l X PπΘ (yl| ˆY (X )). Pπ ˆ Θ(yl| ˆY (X ))

Let h(l1, l2) = h(X , W)l1 h(X , W)l2.

By Eq. (22), we have for any l [k],

0 h(l, l X ) 2H.

For any fixed X , define g X (Θ) as follows:

g X (Θ) = X

h(l, l X )PπΘ(yl| ˆY (X )), (26)

based on which we have ℓˆΘ(W) ℓ(W) Eπ g X (ˆΘ) g X (Θ )

By First Mean Value Theorem,

g X (ˆΘ) g X(Θ ) = g X (ξ), ˆΘ Θ ˆΘ Θ g X (ξ) .

We then bound g X(ξ) element-wisely:

(1). For any i [k], j [n], ˆyj l N(yi, Yj), if i = l X , ˆY j(X ) = ˆyj l ,

g X (ξ) θAcc yi,ˆyj l ,j

h(l, l X ) Pπξ(yl| ˆY (X ))

θAcc yi,ˆyj l ,j

h(l, l X )Pπξ(yi| ˆY (X ))Pπξ(yl| ˆY (X ))

h(l, l X )Pπξ(yi| ˆY (X ))Pπξ(yl| ˆY (X ))

2HPπξ(yi| ˆY (X ))(1 Pπξ(yi| ˆY (X )))

=2HVar h 1Y =yi| ˆY (X ) i .

Published as a conference paper at ICLR 2022

If i = l X , ˆY j(X ) = ˆyj l ,

g X (ξ) θAcc yi,ˆyj l ,j

h(l, l X ) Pπξ(yl| ˆY (X ))

θAcc yi,ˆyj l ,j

l/ {i,l X }

h(l, l X )Pπξ(yi| ˆY (X ))Pπξ(yl| ˆY (X ))

+ h(i, l X ) Pπξ(yi| ˆY (X )) Pπξ(yi| ˆY (X ))Pπξ(yi| ˆY (X ))

l/ {i,l X }

h(l, l X )Pπξ(yi| ˆY (X ))Pπξ(yl| ˆY (X )),

h(i, l X ) Pπξ(yi| ˆY (X )) Pπξ(yi| ˆY (X ))Pπξ(yi| ˆY (X )) o

2HPπξ(yi| ˆY (X ))(1 Pπξ(yi| ˆY (X )))

=2HVar h 1Y =yi| ˆY (X ) i .

If ˆY j(X ) = ˆyj l ,

g X (ξ) θAcc yi,ˆyj l ,j

h(l, l X ) Pπξ(yl| ˆY (X ))

θAcc yi,ˆyj l ,j

(2). For j [n], ˆyr Yj, if ˆY j(X ) = ˆyr,

h(l, l X ) Pπξ(yl| ˆY (X ))

h(l, l X ) Pπξ(Y = yl, Y r = 1| ˆY (X )) Pπξ(Y = yl| ˆY (X ))Pπξ( Y r = 1| ˆY (X )) .

f1(l) = Pπξ(Y = yl, Y r = 1| ˆY (X ))

f2(l) = Pπξ(Y = yl| ˆY (X ))Pπξ( Y r = 1| ˆY (X )),

B1 = {l : f1(l) f2(l), l = l X },

B2 = {l : f1(l) < f2(l), l = l X }.

Published as a conference paper at ICLR 2022

h(l, l X ) Pπξ(Y = yl, Y r = 1| ˆY (X )) Pπξ(Y = yl| ˆY (X ))Pπξ( Y r = 1| ˆY (X ))

h(l, l X ) (f1(l) f2(l))

l B1 h(l, l X ) (f1(l) f2(l)) + X

l B2 h(l, l X ) (f1(l) f2(l))

l BT h(l, l X ) (f1(l) f2(l))

l B1 h(l, l X ) (f1(l) f2(l)) , X

l B2 h(l, l X ) (f2(l) f1(l))

On the other hand,

l B1 h(l, l X ) (f1(l) f2(l))

l B1 h(l, l X ) Pπξ(Y = yl, Y r = 1| ˆY (X )) Pπξ(Y = yl| ˆY (X ))Pπξ( Y r = 1| ˆY (X ))

Pπξ(Y = yl, Y r = 1| ˆY (X )) Pπξ(Y = yl| ˆY (X ))Pπξ( Y r = 1| ˆY (X ))

=2H Pπξ(Y = yl, l B1, Y r = 1| ˆY (X )) Pπξ(Y = yl, l B1| ˆY (X ))Pπξ( Y r = 1| ˆY (X ))

=2H Pπξ(Y = yl, l B1, Y r = 1| ˆY (X ))Pπξ( Y r = 0| ˆY (X ))

Pπξ(Y = yl, l B1, Y r = 0| ˆY (X ))Pπξ( Y r = 1| ˆY (X ))

2H Pπξ( Y r = 1| ˆY (X ))Pπξ( Y r = 0| ˆY (X ))

=2HVar h Y r| ˆY (X ) i .

Similarly, we have X

l B1 h(l, l X ) (f2(l) f1(l)) X

l B2 h(l, l X ) Pπξ(Y = yl, Y r = 1| ˆY (X )) + Pπξ(Y = yl| ˆY (X ))Pπξ( Y r = 1| ˆY (X ))

2HVar h Y r| ˆY (X ) i .

Conclusively, we have g X (ξ)

2HVar h Y r| ˆY (X ) i .

If ˆY j = ˆyr, similar to (1), we have g X (ξ)

Published as a conference paper at ICLR 2022

(3). For any ˆyi, ˆyj ˆY, by the definition of ϕt ˆyi,ˆyj, there exists (a, b) {0, 1}2, such that ϕt ˆyi,ˆyj(a, b) = 0. Similar to (2), let

f3(l) = Pπξ(Y = yl, Y i = a, Y j = b| ˆY (X ))

f4(l) = Pπξ(Y = yl| ˆY (X ))Pπξ( Y i = a, Y j = b| ˆY (X ))

B3 = {l : f3(l) f4(l), l = l X },

B4 = {l : f3(l) < f4(l), l = l X }

we have g X (ξ)

l B3 h(l, l X ) (f3(l) f4(l)) , X

l B4 h(l, l X ) (f4(l) f3(l))

2HVar h ϕt ˆyi,ˆyj( Y i, Y j)| ˆY (X ) i

( ) 2H Var h Y i| ˆY (X ) i + Var h Y j| ˆY (X ) i ,

where inequality ( ) comes from Lemma D.1.

(4). For any yi Y, ˆyr ˆY, by the definition of ϕt yi,ˆyr, there exists a {0, 1}, yj Y, s.t., ϕt yi,ˆyr(yj, a) = 0. We further divide the proof into two cases: ϕt yi,ˆyr(yi, a) = 0, and ϕt yi,ˆyr(yi, a) = 0.

(4a). If ϕt yi,ˆyr(yi, a) = 0, we have tyi ˆyr = tsg and consequently a = 1. Similar to (1-3)., we have

h(l, l X ) Pπξ(yl| ˆY (X ))

l=1 h(l, l X ) Pπξ(yl| ˆY (X ))

l =i h(l, l X ) Pπξ(Y = yl, Y r = 1| ˆY (X )) Pπξ(Y = yl| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

h(i, l X )Pπξ(Y = yi| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X )) ,

where Eq. ( ) is due to Let

f5(l) = Pπξ(Y = yl, Y r = 1| ˆY (X ))

f6(l) = Pπξ(Y = yl| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

B5 = {l : f5(l) f6(l), l = i},

B6 = {l : f5(l) < f6(l), l = i}

Then we have g X (ξ)

l B5 h(l, l X ) (f5(l) f6(l)) , X

l B6 h(l, l X ) (f6(l) f5(l)) + h(i, l X )f6(i)

Published as a conference paper at ICLR 2022

On one hand, X

l B5 h(l, l X ) (f5(l) f6(l))

l B5 h(l, l X ) Pπξ(Y = yl, Y r = 1| ˆY (X )) Pπξ(Y = yl| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

l B5 2H Pπξ(Y = yl, Y r = 1| ˆY (X )) Pπξ(Y = yl| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

=2H Pπξ(Y = yl, l B5, Y r = 1| ˆY (X )) Pπξ(Y = yl, l B5| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

=2H Pπξ(Y = yl, l B5, Y r = 1| ˆY (X ))(1 Pπξ(Y = yi, Y r = 1| ˆY (X )))

Pπξ(Y = yl, l B5, Y r = 0| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

2HPπξ(Y = yi, Y r = 1| ˆY (X ))(1 Pπξ(Y = yi, Y r = 1| ˆY (X )))

=2HVarπξ h ϕt yi,ˆyl(Y, Y r)| ˆY (X ) i

2HVarπξ h 1Y =yi| ˆY (X ) i + 2HVarπξ h Y r| ˆY (X ) i .

On the other hand, X

l B6 h(l, l X ) (f6(l) f5(l)) + h(i, l X )f6(i)

l B6 h(l, l X ) Pπξ(Y = yl, Y r = 1| ˆY (X )) + Pπξ(Y = yl| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

+ h(i, l X )Pπξ(Y = yi| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

2H Pπξ(Y = yl, l B6, Y r = 1| ˆY (X ))(1 Pπξ(Y = yi, Y r = 1| ˆY (X )))

Pπξ(Y = yl, l B6, Y r = 0| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

+ Pπξ(Y = yi| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

2H Pπξ(Y = yl, l B6, Y r = 0| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

+ Pπξ(Y = yi| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

2H Pπξ( Y r = 0| ˆY (X ))Pπξ( Y r = 1| ˆY (X )) + Pπξ(Y = yi| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

2HVarπξ h 1Y =yi| ˆY (X ) i + 2HVarπξ h Y r| ˆY (X ) i .

Therefore, in this case, we have g X (ξ)

2HVarπξ h 1Y =yi| ˆY (X ) i + 2HVarπξ h Y r| ˆY (X ) i .

(4b). If ϕt yi,ˆyr(yi, a) = 0, similar to (4a)., we have

l =i h(l, l X )Pπξ(Y = yl| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

+ h(i, l X ) Pπξ(Y = yi, Y l = 1| ˆY (X )) Pπξ(Y = yi| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X )) .

Published as a conference paper at ICLR 2022

h(i, l X ) Pπξ(Y = yi, Y r = 1| ˆY (X )) Pπξ(Y = yi| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

= h(i, l X )Pπξ(Y = yi, Y r = 1| ˆY (X ))Pπξ(Y = yi| ˆY (X ))

we have g X (ξ)

max n h(i, l X )Pπξ(Y = yi, Y r = 1| ˆY (X ))Pπξ(Y = yi| ˆY (X )) ,

l =i h(l, l X )Pπξ(Y = yl| ˆY (X ))Pπξ(Y = yi, Y r = 1| ˆY (X ))

2HVar h 1Y =yi| ˆY (X ) i .

Combining (4a). and (4b)., we have that g X (ξ)

2H Varπξ h 1Y =yi| ˆY (X ) i + Varπξ h Y l| ˆY (X ) i .

Combining (1-4)., we then have

j=1 (|N(yi, Yj)| 1) Varπξ h 1Y =yi| ˆY (X ) i2 (27)

j [n],ˆyr Yj Varπξ h Y r| ˆY (X ) i2

Varπξ h Y i| ˆY (X ) i + Varπξ h Y j| ˆY (X ) i 2

i [k],j [ˆk]

Varπξ h 1Y =yi| ˆY (X ) i + Varπξ h Y l| ˆY (X ) i 2

i=1 (ni + ˆk)Varπξ(1Y =yi| ˆY = ˆY )2 +

i=1 (mi + K 1)Varπξ( Y i| ˆY = ˆY )2

Therefore, by Eqs. (25), (26), and (28), and Assumption Eq. (21), we have

|ℓ(W) ℓˆΘ(W)| =

h(l, l X ) PπΘ (Y = yl| ˆY (X )) Pπ ˆ Θ(Y = yl| ˆY (X ))

= Eπ h g X (Θ ) g X (ˆΘ) i

Eπ g X (ξ) Θ ˆΘ

Now, we apply the assumption that we are able to solve the empirical problem, producing an estimate ˆW that satisfies E h ℓˆΘ( ˆW) ℓˆΘ(W Θ) i χ,

Published as a conference paper at ICLR 2022

where W ˆΘ is the true solution to

W ˆΘ = arg min W ℓΘ(W).

E h ℓ( ˆW) ℓ(W ) i = E h ℓˆΘ( ˆW) ℓˆΘ(W ˆΘ) + ℓˆΘ(W ˆΘ) ℓˆΘ( ˆW) + ℓ( ˆW) ℓ(W ) i

( ) χ + E h ℓˆΘ(W ˆΘ) ℓˆΘ( ˆW) + ℓ( ˆW) ℓ(W ) i

M E ˆΘ Θ + E h ℓˆΘ(W ˆΘ) ℓˆΘ( ˆW) + ℓˆΘ( ˆW) ℓˆΘ(W ) i

where Eq. ( ) comes from condition (23).

With Eqs. (20) and (21), we have Eq. (24) by Lemma D.5, i.e., E ˆΘ Θ 2 E ˆΘ Θ 2 ε2M.

We can now bound this using the result of Lemma D.6, which results in

E h ℓ( ˆW) ℓ(W ) i χ + 4c Hϵ.

The proof is completed.

E EXAMPLES AND ILLUSTRATIONS

E.1 LABEL GRAPH AND LABEL HIERARCHY

Fig 5 shows the mapping between a label hierarchy and the corresponding label graph. Indeed, given the order of labels, any label structure represented as a (directed acyclic graph) DAG can be converted to exact one consistent label graph based on the four types of label relations.

Canidae Domestic

Husky Canidae

Exclusive Overlap Subsuming Subsumed

Figure 5: The illustration of mapping between a DAG of labels and a label graph.

E.2 AN EXAMPLE OF INCONSISTENT LABEL GRAPH

Fig. 6 shows an example of an inconsistent label graph. We can see that the label graph is unrealistic and ambiguous because Husky subsumes Canidae , but (1) Canidae subsumes Dog and (2) Dog subsuems Husky combined imply that Husky should be subsumed by Canidae . Also, from the example, we can see that label graph induced from cyclic label hierarchy must be inconsistent.

Published as a conference paper at ICLR 2022

Overlap Subsuming

Figure 6: The illustration of inconsistent label graph.

E.3 ENUMERATION OF INCONSISTENT TRIANGLE LABEL GRAPH

For a triangle label graph G, we list all inconsistent label relation structures. The consistency of larger label graph with more labels can be verified by checking the consistency of every triangle inside. One example proof of {Exclusive, Overlap, Subsuming} can be found in Lemma 1.

Table 5: Enumeration of Inconsistent Label Relation Triplets.

label relation Triplets

tab tbc tac Overlap Subsumed Subsuming Overlap Subsumed Exclusive Overlap Subsuming Subsumed Overlap Exclusive Subsumed Exclusive Subsumed Subsuming Exclusive Overlap Subsuming Exclusive Subsuming Subsuming Exclusive Subsuming Subsumed Exclusive Subsuming Overlap Subsuming Exclusive Subsumed Subsuming Subsumed Exclusive Subsuming Overlap Subsumed Subsuming Overlap Exclusive Subsuming Subsuming Exclusive Subsuming Subsuming Subsumed Subsuming Subsuming Overlap Subsumed Overlap Subsuming Subsumed Subsumed Exclusive Subsumed Subsumed Subsuming Subsumed Subsumed Overlap Subsumed Exclusive Subsuming Subsumed Exclusive Subsumed Subsumed Exclusive Overlap

E.4 AN EXAMPLE OF INDISTINGUISHABLE LABEL GRAPH

Fig. 7 shows an example label graph with indistinguishable label relation structure. Again, red labels represent desired unseen labels, while gray labels are undesired and seen. We can see that unseen label Husky and Bulldog have indistinguishable label relation structures because for all seen labels, their label relations are equal. For example, seen label Dog subsumes both Husky and Bulldog . In contrast, for Husky and Bengal Cat , seen label Cat subsumes the latter but exclusive to the former, which indicates that Husky and Bengal Cat have distinguishable label relation structure. Note that Bengal Cat and Persian Cat also have indistinguishable label relation structure, but the former is unseen desired label while the latter is seen and can be predicted by some ILF(s). We are only interested in the distinguishablity of a pair of unseen labels.

Published as a conference paper at ICLR 2022

In practice, users could "break the symmetry" by adding new ILFs with new labels. For example, if we add an ILF that could predict Arctic Animals , then the new seen label Arctic Animals will be added into label graph as shown in Fig. 8. We know that Arctic Animals subsumes Husky but not Bulldog , so we break the indistinguishable label relation structure of Husky and Bulldog successfully.

Exclusive Overlap Subsuming

Subsumed Bulldog Bengal

Figure 7: An example of an indistinguishable label relation structure ( Husky and Bulldog ).

Subsumed Bulldog Bengal

Arctic Animals

Arctic Animals

Figure 8: An example of fixing an indistinguishable label relation structure ( Husky and Bulldog ) by adding a new label ( Arctic Animals ).

F EXPERIMENTAL DETAILS

F.1 DATASET

Large scale Text Classification Dataset3: LSHTC-3 (Partalas et al., 2015), a large scale hierarchical text classification dataset, which consists of 456,886 documents and 36,504 categories organized in a label hierarchy. We filter out the documents with multiple labels, and preserve categories with more than 500 documents. We use a pre-trained sentence transformer (Reimers & Gurevych, 2019) to obtain document embeddings for classification. We follow Zhang et al. (2021) to generate 5 keyword-based labeling functions for each seen label as ILFs.

Large scale Image Classification Dataset4: ILSVRC2012 (Russakovsky et al., 2015), a large scale image classification dataset, which consists of 1.2M training images from 1000 object classes based

3http://lshtc.iit.demokritos.gr/ 4http://image-net.org/challenges/LSVRC/2012/index#data

Published as a conference paper at ICLR 2022

on Image Net. Following Deng et al. (2014) we use Word Net as the label hierarchy, and because all the images are assigned to leave labels in Word Net, for each non-leave label, we aggregate images belonging to its descendants as its data points (Deng et al., 2014). For weak supervision sources creation, we follow Mazzetto et al. (2021b;a) to train 10 image classifiers as ILFs. We randomly sampling 2 or 3 exclusive seen labels from the label graph as well as 500 images for each label to train a Res Net-32 classifier.

F.2 DESCRIPTION OF APPLYING DAP

To apply DAP, we use both label relations and ILFs to construct attributes for both unseen classes and unlabeled data points. Then, we train the attribute classifiers, which in turn are used to predict unseen labels on the test set as in Lampert et al. (2013). To construct attributes for unseen labels and data points, we leverage the outputs of ILFs and label relations.

First, based on the label relations and basic logistic rules, we enumerate all the possible assignments of seen labels given a data point. For example, if label A is subsumed by label B, then for a data point, when it belongs to label A, it must also belong to B; And if label A and B are exclusive, then one data cannot belong to both at the same time. Let s S denote one possible label assignment and S is the set of all possible s. Then we define the attribute as a vector of |S| dimension where each dimension corresponds to one s.

Second, we define the attribute of unseen labels. For an unseen label A and a label assignment s, if A is not exclusive to any label in s then we set the corresponding attribute as = 1 for label A, other wise 0. The intuition is that, if A is not exclusive to labels in s, it s likely that when a data belongs to assignment s, it also belongs to label A. For each data point, we use the labels assigned by ILFs to build their attributes. If a data belongs to assignment s then its corresponding attribute as = 1, otherwise 0.

Then, we can train attribute classifier p(a|x) for each attribute based on data point attributes. During inference, we use unseen label attribute as well as attribute classifier as in Lampert et al. (2013):

f(x) = arg max c

p(ac m|x) p(acm|x) (29)

F.3 HYPER-PARAMETERS

For the training of PGMs, we set the learning rate to be 1

n where n is the number of training data. For training logistic regression model, we use the default parameters in scikit-learn library. For training Res Net model, we set batch size as 256 and use Adam optimizer with learning rate being 1e-3 and weight decay being 5e-5.

F.4 HARDWARE AND IMPLEMENTATION DETAILS

All experiments ran on a machine with an Intel(R) Xeon(R) CPU E5-2678 v3 with a 512G memory and a Ge Force GTX 1080Ti-11GB GPU.

All the code was implemented in Python. We use the standard implementation of the logistic regression model from Python scikit-learn library5 and the Res Net model from torchvision library6.

Our code will be released upon the acceptance.

F.5 DATASET DETAILS OF REAL-WORLD APPLICATIONS

We list the tags we used in the real-world application (Sec. 7.3) and examples of label relations we query from the existing product category taxonomy.

5https://scikit-learn.org/stable/modules/generated/sklearn.linear_model. Logistic Regression.html

6https://pytorch.org/docs/stable/torchvision/models.html

Published as a conference paper at ICLR 2022

Table 6: The tags and examples of label relations of Car Accessories category.

new unseen tags: Performance Modifying Parts , Vehicle Tires & Tire Parts , Car Engines & Engine Parts

existing tags: Car Modification Parts , Car Parts & Accessories Car & Truck Tires , Replacement Car Parts , Car & Truck Wheels

label relation examples: Replacement Car Parts subsumes Car Engines & Engine Parts Car & Truck Tires is subsumed by Vehicle Tires & Tire Parts

Table 7: The tags and examples of label relations of Furniture Accessories category.

new unseen tags: Clothing & Shoe Storage , Living Room Furniture , Beds & Headboards

existing tags: Coffee Tables & End Tables , Entertainment & Media Centers Bedroom Furniture , Sofas & Chairs , Mattresses

label relation examples: Bedroom Furniture subsumes Beds & Headboards Sofas & Chairs is subsumed by Living Room Furniture

G ADDITIONAL EXPERIMENTS

G.1 PERFORMANCE DROP WHEN THE DISTINGUISHABLE CONDITION IS VIOLATED

To validate the effectiveness of the distinguishable condition, we drive another 100 WIS tasks from LSHTC-3 dataset where each task has at least one pair of unseen labels sharing exactly the same label relation structure. In Table 8, we report the performance drop on the averaged evaluation results over the 100 WIS tasks with comparison to the numbers in Table 2. Although the two sets of WIS tasks are different and therefore are not individually comparable, the averaged performance drop does indicates that the violation of the distinguishable condition results in undesirable synthesized training labels, which implicitly demonstrates the effectiveness of the distinguishable condition.

Table 8: Performance drop on averaged evaluation results over 100 WIS tasks derived from LSHTC-3 when the distinguishable condition is violated.

Method Accuracy F1-score

Label Model

LR-MV -11.49 -13.83 W-LR-MV -11.51 -13.47 WS-LG -9.28 -8.63

PLRM -9.66 -9.63

LR-MV -16.14 -17.08 W-LR-MV -15.27 -15.97 WS-LG -13.13 -13.78

PLRM -13.39 -14.09