# learnability_with_indirect_supervision_signals__370a15c0.pdf Learnability with Indirect Supervision Signals Kaifu Wang University of Pennsylvania kaifu@sas.upenn.edu Qiang Ning Amazon qning@amazon.com Dan Roth University of Pennsylvania danroth@seas.upenn.edu Learning from indirect supervision signals is important in real-world AI applications when, often, gold labels are missing or too costly. In this paper, we develop a unified theoretical framework for multi-class classification when the supervision is provided by a variable that contains nonzero mutual information with the gold label. The nature of this problem is determined by (i) the transition probability from the gold labels to the indirect supervision variables and (ii) the learner s prior knowledge about the transition. Our framework relaxes assumptions made in the literature, and supports learning with unknown, non-invertible and instancedependent transitions. Our theory introduces a novel concept called separation, which characterizes the learnability and generalization bounds. We also demonstrate the application of our framework via concrete novel results in a variety of learning scenarios such as learning with superset annotations and joint supervision signals. 1 Introduction We are interested in the problem of multiclass classification where direct and gold annotations for the unlabeled instance are expensive or inaccessible, and instead the observation of a dependent variable of the true label is used as supervision signal. Examples include learning from noisy annotations [1, 21, 26], partial annotations [16, 22, 14] or feedback from an external world [15, 8]. To extract the information contained in a dependent variable, the learner should have certain prior knowledge about the relation between the true label and the supervision signal, which can be expressed in various forms. For example, in the noisy label problem, the noisy rate is assumed to be bounded by a constant (such as the Massart noise [23, 17]). In the superset problem, the true label is commonly assumed to be contained in (or consistent with) the superset annotation [16, 22]. As in [13, 29, 36], we model the aforementioned relation using a transition probability, which is the distribution of the observable variable conditioned on the label and instance. The transition enables the learner to induce a prediction of the observable via the prediction of the label, and construct loss functions based on the induced prediction and the observable. In this paper, instead of assuming that the learner fully knows the transition, we formalize the concept of transition class, a set that contains all the candidate transitions, to describe more general forms of prior information. Also, we define the concept of separation to quantify whether the information is enough to distinguish different labels. With these concepts, we are able to study a variety of learning scenarios with unknown, non-invertible and instance-dependent transitions in a unified way. We show this under the realizability assumption (also called separable in linear classification), a commonly made assumption (such as [2, 19, 22]) that assumes that the true classifier is in the hypothesis space. Our goal is to develop a unified theoretical framework that can (i) provide learnability conditions for general indirect supervision problems, (ii) describe what prior knowledge is needed about the transition, and (iii) characterize the difficulty of learning with indirect supervision. Work done while at the Allen Institute for AI and at the University of Illinois at Urbana-Champaign. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Specifically, in this paper, our main contribution includes: 1. We decompose the learnability condition of a general indirect supervision problem into three aspects: complexity, consistency and identifiability and provide a unified learning bound for the problem (Theorem 4.2). 2. We propose a simple yet powerful concept called separation, which encodes the prior knowledge about the transition using statistical distance between distributions over the annotation space and uses it to characterize consistency and identifiability (Theorem 5.2). 3. We formalize two ways to achieve separation: total variation and joint supervision, and use them to derive concrete novel results of practical learning problems of interest (Section 5.2 and 5.3). All proofs of the theoretical results are presented in the supplementary material. 2 Related Work Specific Indirect Supervision Problems. Our work is motivated by many previous studies on the problem of learning in the absence of gold labels. Specially, the problem of classification under label noise dates back to [1] and has been studied extensively over the past decades. Our work is mostly related to (i) Theoretical analysis of PAC guarantees and consistency of loss functions, including learning with bounded noise [23, 21, 2], and instance-dependent noise [30, 24, 12]. (ii) Algorithms for learning from noisy labels, including using the inverse information of the transition [26, 36], and inducing predictions of noisy label (which is more similar to our formulation) [9, 34]. Superset (also called partial label) problems, where the annotation is given as a subset of the annotation space, arises in various forms in standard multiclass classification and structured prediction [16, 14, 20, 27]. While it is possible to extend some approaches in the theory of noisy problems to the superset case, the superset problem focuses on the case of a large and complex annotation space, and some of the assumptions (such as known transition") would be too strong in practice. On the theoretical side, [16] defines ambiguity degree to characterize the learning bound. [22] provides an insightful discussion of the PAC-learnability of the superset problem and proposes the concept of induced hypothesis. This two papers motivate the approach pursued in this paper. Frameworks for Indirect Supervision. Our supervision scheme is conceptually similar to [33, 29], which model the label as a latent variable of the indirect supervision signal. However, the discussion is restricted to the exponential family model. [10, 11] also propose a framework and algorithms to supervise a structured learning problem with an indirect supervision, which is modeled as a binary random variable associated with the gold label. Our work extends the binary indirect signal to a multiclass signal and gives a theoretical treatment to this general learning problem. [13, 14] study the problem of designing consistent loss functions for superset problems when the transition (aka mixing) matrix is partially known. The discussion can be applied to a wider range of problems such as noisy and semi-supervised learning. Our framework can also be compared to the multitask learning framework proposed in [5, 6], which defines a notion called the F-relatedness to describe the relation machine learning tasks through some deterministic functions within the instance space X. As contrast, our framework studies the probabilistic transitions between different domains (from the label space Y to the annotation space O) and our gold label Y is not observable, which is different than a multitask setting. Our goal is mostly related to [36], which further develops the ideas from [32, 14, 26] and develops a general framework of learning from data with reconstructible corruption, using the inverse of a known, instance-independent transition matrix, to construct unbiased estimator of the classification loss and derive generalization bounds. Our study aims to relax these assumption on the transition matrix, especially when the label space and/or annotation space is large and the estimation of the transition matrix could be difficult. 3 Preliminaries We will use P( ) to denote probability, E[ ] to denote expectation, 1{ } to denote the indicator function and p( ) to denote the density function or more generally, the Radon Nikodym derivative. We denote the source variable as X, which takes value in an input space X and denote the target label as Y , which takes value in a label space Y. We assume |Y| = c is finite and identify the elements in Y as {y1, y2, . . . , yc}. The goal is to learn a mapping h0 : X Y. The hypothesis class H contains candidate mappings h : X 7 Y. The loss function for hypothesis h and sample (x, y) is denoted as ℓ(h(x), y). The risk of a hypothesis h H is defined as R(h) := EX,Y [ℓ(h(x), y)] , where x is sampled independently from a (unknown) distribution DX. We will focus on the realizable case, i.e., there is a classifier h0 H such that R(h) = 0. As in standard PAC-learning theory, we use the zeroone loss for the gold sample (x, y) (although we may not observe y): ℓ(h(x), y)) = 1 {h(x) = y}. An annotation O (also called supervision signal) is a random variable that is not independent with Y (or equivalently, O and Y has positive mutual information). The dependence between X and O conditioned on Y is allowed but not required. O takes value in an annotation space denoted as O. We also assume |O| = s < and identify the elements in O as {o1, o2, . . . , os}. For convenience, when using yi and oi as subscripts, we regard yi, oi as its index i. For example, for any indexed quantity ai, we will denote ayi as ai. We denote the probability simplex of dimension s as: DO = {w Rs : Ps i=1 wi = 1, wi 0}, which represents the set of all distributions over O. Examples of annotation O: (i) In the noisy problem, the true label is replaced (due to mislabeling or corruption) by another label with certain probabilities. Therefore, O = Y. (ii) In the superset annotation problem, the learner observes o which is a subset of Y (hopefully but not necessarily o contains the true label y). In this case, O = 2Y, the power set of Y. In our framework, the learner predicts O using the graphical model shown in fig. 1. The conditional distribution of O given X = x and Y = y can be identified by a mapping from x to a transition matrix T0(x) := [P(O = oj|X = x, Y = yi)]ij. A key point of this paper is that in general we do not assume that the learner (or learning algorithm) has full information of T0(x). Instead, we define a transition hypothesis to be a candidate transition T(x) (also denoted as T for convenience) that maps the instance x to a stochastic matrix of size c s. For a fixed x, the ith row of a transition T(x) represents a distribution over O, and is denoted as (T(x))i. The set of all candidate transition hypotheses is called the transition class, denoted as T . We assume T0 T . When it is needed to distinguish transition hypothesis from classifiers in H, we will call the latter one a base hypothesis. With a transition hypothesis T, a base hypothesis ˆy = h(x) naturally induces a probability distribution (T(x))h(x). We call it induced hypothesis, denoted as T h. X b Y b P(O|X, b Y ) source annotation H T Figure 1: Supervision Model. The learner predicts the label b Y of X via H. To supervise using the observation of (X, O), the learner uses b Y to induce a probabilistic prediction over O via T . One may penalize T h by evaluating its prediction of O on the dataset. More precisely, in our framework, the learner will be penalized by provided with an annotation loss ℓO(by, T, (x, o)) : Y T X O 7 R. A natural example is the cross-entropy loss, which approximates the conditional probability of O: ℓO(h(x), T, (x, o)) := log P(o|x, h(x), T) (1) The annotation risk is defined as RO(T h) := Ex,o[ℓO(h(x), T, (x, o))]. A training set S = {(x(i), o(i))}m i=1 contains independent samples of X and O. The empirical annotation risk associated with the training sample S is then defined as b RO(h T|S) := 1 m Pm i=1 ℓO(h(x(i)), T, (x(i), o(i))). In summary, the learner s input includes: the spaces X, Y, O, the hypothesis class H and transition class T , the training set S = {x(i), o(i)}m i=1, and the loss functions ℓ, ℓO. A hypothesis class H is said to be (T )-learnable if there is a learning algorithm A : m=1(X O)m 7 H such that: for any distribution DX over X and T0 T , when running A on datasets S(m) of m independent samples of (X, O), we have R(A(S(m))) converges to 0 in probability as m . In particular, we define the Empirical Risk Minimizer to be a mapping ERM : m=1(X O)m 7 H such that ERM(S) argminh H,T T b RO(h T|S), where the argmin operator only returns the base hypothesis (although the empirical risk is minimized over both base and transition hypotheses). 4 General Learnability Conditions In this section, we present theorem 4.2 that decomposes the learnability of a general indirect supervision problem into three aspects: complexity, consistency and identifiability. After that, we provide proposition 4.3 to help verifying the complexity condition. The other two conditions will be further studied in the next section. We assume ℓO takes value in an interval [0, b] for some constant b > 0. To characterize the learnability, a key step is to describe the complexity of the function class ℓO T H def == {(x, o) 7 ℓO(T, (x, h(x), o)) : h H, T T } To do so, we use the following generalized version of VC-dimension proposed in [3]. It will be shown in proposition 4.3 that it enables us to bound the Rademacher complexity [4] of ℓO T H (which provides the flexibility to study arbitrary loss function) via the standard VC dimension or Natarajan dimension [25] (also see Chapter 29 of [7] for an introduction) of H (which is in general easier to compute than the Rademacher complexity). Definition 4.1. We adopt the following definitions from [3]: 1. (shattering and VC-class) A class C of subsets of a set Z is said to shatter a finite subset Z Z if {C Z : C C} = 2Z Moreover, C is called a VC-class with dimension no larger than k if there exists an integer k such that C cannot shatter any subset of Z with more than k elements. 2. (weak VC-major) The function class ℓO T H is said to be weak VC-major with dimension d if d is the smallest integer such that for all u R, the set family Cu def == {{(x, o) : ℓO(h(x), T, (x, o)) > u} : h H, T T } is a VC-class of X O with dimension no larger than d. Now we are able to state the main result of this section: Theorem 4.2. If the following conditions are satisfied [C1] (Complexity) ℓO T H is weak VC-major with dimension d < . [C2] (Consistency) h0 argmin h H,T T [C3] (Identifiability) η def == inf h H,T T :R(h)>0 RO(T h) inf T T RO(T h0) Then, H is T -learnable. That is, for any δ (0, 1), with probability of at least 1 δ, we have: R(ERM(S(m))) 2b where Γm(d) is defined in [3] by Γm(d) def == log = d log m(1 + o(1)) as m This implies R(ERM(S(m))) 0 in probability as m . Bound (2) suggests that the difficulty of the learning can be characterized by (i) the identifiability level η, which mainly depends on the nature of the indirect supervision signal and the learner s prior information of the transition hypothesis, which will be further studied in the next section. (ii) the weak VC-major d of ℓO T H, which depends on the modeling choice. We present the following results that bound d by the Natarajan dimension of H and the weak-VC major dimension of the class: ℓO T def == {(x, by, o) 7 ℓO(by, T, (x, o)) : T T } 2This argmin operator only returns the base hypothesis. Proposition 4.3. Suppose the Natarajan dimension of H is d H < and the weak-VC major dimension of ℓO T is d T < . Then, the weak-VC major dimension of ℓO T H, d, can be bounded by: d 2 ((d H + d T ) log(6(d H + d T )) + 2d H log c) where c = |Y| The reason that we do not study the complexity of T separately is that the annotation loss may be independent of T (i.e., ℓO(by, T1, (x, o)) = ℓO(by, T2, (x, o)) for any T1, T2 T ). See proposition 5.5 for an example of such a loss. To show applications of proposition 4.3, we study the following cases: Example 4.4. In the following cases, we first compute/bound d T , then d can be bounded by d H: 1. When the true transition is known or when the annotation loss function only depends on (by, o), we have d T = 0; hence d 2d H(log(6d H) + 2 log c). This is conceptually similar to the Lemma 3.4 in [22], which bounds the VC-dimension of the induced hypothesis class for the noise-free superset problem. 2. When all transition hypotheses in T are instance-independent and the annotation loss only depends on (T, by, o) (e.g., the cross-entropy loss defined in (1)), then d T can be trivially bounded by d T cs = |Y O|; hence d 2((d H + cs) log(6(d H + cs)) + 2d H log c). 3. Suppose the instance is embedded in a vector space X = Rp. Consider the problem (Example 5.1.3 in [24]) of binary classification with a uniform noise rate which is modeled as a Logistic regression: P(O = y|x, y) = S(w Tx) where S is the sigmoid function and w is the parameter. Then the cross-entropy loss becomes: 1{o = by} log(S(w Tx)) 1{o = by} log(1 S(w Tx)). We have d T 2p + 2. See supplementary material for a proof. 5 Separation Throughout this section we assume [C1] of Theorem 4.2 holds. We will first propose a concept called separation, which provides an intuitive way to understand the learnability and helps to verify [C2] and [C3]; then we study two ways to ensure separation, and their application in real problems. 5.1 Learning by Separation Without any prior knowledge, the transition class will contain all possible transitions. In this case, learnability cannot be ensured since a wrong label by can also induce a good prediction of O via an incorrect transition hypothesis. Hence, certain kind of prior knowledge is needed to restrict the range of T . To formalize this idea, we first introduce an extension of the KL-divergence. Definition 5.1 (KL-divergence between Two Sets of Distributions). Given two sets of distributions D1 and D2, we define the KL-divergence between them as: KL(D1 D2) def == inf D1 D1,D2 D2 KL(D1 D2) Now we are able to state the main result of this section: Theorem 5.2 (Separation). For all x X, we denote the induced distribution families by label yi as Di(x) def == {(T(x))i : T T } DO (recall that (T(x))i is the ith row of T(x)), and the set of all possible predictions of the label as H(x) def == {h(x) : h H} Y. Suppose γ def == inf (x,i,j):p(x,yi)>0,j =i,yj H(x) KL(Di(x) Dj(x)) > 0 (3) Then H is learnable from the observations of (X, O) with η γ > 0 via the ERM of cross-entropy loss (1). We call γ the separation degree. Moreover, if (3) is not satisfied, then there exists a sequence of transitions {T (k)}k (T (k) T ) and distributions {D(k) X }k over X such that limk η(k) = 0 , where η(k) is defined the same as η in [C3], with the expectation (in the definition of the risk functions) being taken according to T (k) and D(k) X . Figure 2: (a) Illustration of separation. A (predicted) label yi will induce a distribution family over O called Di(x). Different families are separated by a minimal distance γ. (b) Illustration of joint supervision. By adding new supervision signals, separation of particular pairs of labels is preserved. Theorem 5.2 is important in two ways: (i) It provides a way to characterize the prior knowledge of the learner about the transition using the KL-divergence and reveals its connection with the identifiability of labels. (ii) The moreove result shows if separation is not satisfied, then the induced distribution of O by different labels can be arbitrarily close, and hence the learning of Y from O can be arbitrarily difficult. An illustration of separation is shown in fig. 2 (a). Yet, a drawback of the cross-entropy loss used in theorem 5.2 is that it can be unbounded when there is a zero element in the transition matrix. This problem will be partly solved in proposition 5.5 by introducing a different annotation loss. As the simplest application, we introduce the case where the transition is fully known to the learner. In this case, the induced distribution families Di(x) reduce to some points in the probability simplex. Example 5.3 (Full Information of the Transition). Suppose the transition T(x) is known to the learner, i.e., T = {T0}, by Theorem 5.2, we know H is learnable if inf (x,i,j):p(x)>0, i =j,yj H(x) KL((T0(x))i (T0(x))j) > 0 Notice that this is a weaker assumption than the invertibility assumption of T0(x), which is used in [36] (called reconstructible corruption). This is possible because we assume a deterministic rule for X Y but a randomized process for X, Y O, hence the latter one could contain more information and is capable of encoding the deterministic rule even when O is smaller than Y. For a concrete example, consider the following constant transition matrix: "0.1 0.9 0.5 0.5 0.9 0.1 In this case, |Y| = 3 > |O| = 2 and therefore T is not right-invertible. However, since the distribution of O induced by each labels Y is known to learner, and this distribution could be estimated from the observation of O, the learner is able to recover the true label from the indirect supervision. 5.2 Separation by Total Variation In this subsection, we introduce a way to guarantee separation by controlling the KL-divergence using total variation distance, which is done via the well-known Pinsker s inequality [35]: Lemma 5.4 (Pinsker s inequality, proposed in [28], see [35] for an introduction). If P and Q are two probability distributions on the same measurable space (Ω, F), then where TV is the total variance distance: P Q TV def == sup A F P(A) Q(A) . Moreover, if Ωis countable (in our case, Ω= O is finite and hence, then the total variance distance is equivalent to the L1-distance in the sense that P Q TV = 1 This lemma suggests we can ensure separation by controlling the L1-distance. To show a concrete example, we introduce the concentration condition. The intuition behind it is that the information of different labels in Y is concentrated in relatively different sets of annotations. Formally: Proposition 5.5 (Concentration). A sufficient condition for (3) is that for every 1 i c, there exists a set Si O (we call them concentration sets) such that γC def == inf (i,j,x,T ):T T ,p(x)>0,j =i PT (O Si|x, yi) PT (O Sj|x, yi) > 0 (4) where PT ( ) is the conditional probability defined by transition T. Under this condition, we can relate identifiability and separation degree by η γ 2γ2 C. Since a condition imposed on all T T can be regarded as an assumption imposed on the true transition T0, condition (4) can be rewritten as: γC = inf (i,j,x):p(x,yi)>0,i =j P(O Si|x, yi) P(O Sj|x, yi) > 0 (5) Also, in this case, one can ensure learnability by the ERM which minimizes the following transitionindependent annotation loss ℓO(h(x), T, (x, o)) def == 1{o / Sh(x)} (6) For this annotation loss, we can bound the identifiability level by η γC. Example 5.6 (Superset with Noise). For superset with noise problem where O is a random subset of Y (i.e., O = 2Y), let Si = {o : yi o} O, the conditional (5) becomes γC = inf p(x,yi)>0,i =j P(yi O|x, yi) P(yj O|x, yi) > 0 (7) This generalizes the small ambiguity degree condition proposed in [16, 22], which assumes P(yi O|x, yi) = 1 (i.e., the gold label always lies in the superset). [16, 22] also proposes a superset loss, which is the special case of (6). We extend the discussion to allow the presence of noise. The following example can be regarded as a special case of Example 5.6. Example 5.7 (Label Noise). For noisy problem where O = Y, let Si = {yi}, condition (4) becomes γC = inf p(x,yi)>0,i =j P(O = yi|x, yi) P(O = yj|x, yi) > 0 (8) This generalizes the Massart noise condition [23] of binary classification, which assumes the noise rate is lower bounded by 1/2 minus a constant. We extend the discussion to multiclass case. Also, notice that (6) is simply the zero-one loss for O, which means learnability can still be guaranteed if one ignores the noisy process and learns O as clean label. This partly explains the empirical study in [31], which tests the robustness of neural networks (without additional denoising process) to noise in annotations. [31] proposes a parameter called δ-degree which is similar to γC and observes that the performance of the network decreases as δ decreases, as our learning bound (2) suggests. We can further generalize proposition 5.5 by encoding functional prior information of the transition: Proposition 5.8 (Evidence). A sufficient condition for (3) is that there exists Lipschitz (with respect to the L1-norm of vectors) functions Φij : Rc R, 1 i = j c (we call them evidence) with Lipschitz constants Lij such that γij def == inf p(x,yi)>0,yj H(x),Di Di(x),Dj Dj(x) Φij (Di) Φij (Dj) > 0 (9) In this case, the separation degree can be bounded by γ 1/2 mini =j (γij/Lij)2. In particular, the dot product with a fixed vector Φu(t) = u, t ( , represents the dot product) is Lipschitz with Lipschitz constant Lu u . As an example, given sets Si O (1 i c), letting Φij( ) = P k:ok Si ˆek P k:ok Sj ˆek, recovers proposition 5.5, where ˆek is the kth standard unit basis vector of Rc. Another example will be given in Example 5.13. 5.3 Separation by Joint Supervision When a weak supervision signal cannot ensure learnability individually, it needs to be used with other forms of annotations together to supervise the learning. Our goal in this subsection is to provide a way to describe the effect of using multiple sources of annotations jointly. We will show that joint supervision can improve (Example 5.13), preserve (Proposition 5.10) or even damage (Remark 5.11) the separation. First, we formulate the joint supervision problem. For simplicity, we only consider the case that we have two sources of annotations O1, O2, and the general case can be discussed in a similar way. For each Ok, k {1, 2}, denote its annotation space as Ok, its transition as Tk(x) and its transition classes as Tk. We focus on the scenario that for each instance x, there is only one type of annotation. Then the joint annotation space is O = O1 O2. We model the annotation type 1{O = Ok} as a random variable that is independent with X and all the Ok, and the probability P(O = O1) = λ is known to the learner. Then the joint annotation is defined as: O = 1{O = O1}O1 + 1{O = O2}O2. Next, we quantify the supervision power of an annotation if separation is not guaranteed via a local version of the separation (degree): Definition 5.9 (Pairwise Separation). Define the separation degree of yi to yj as γi j def == inf x:p(x,yi)>0,yj H(x) KL(Di(x) Dj(x)) (10) We say the labels yi is separated from a yj if γi j > 0. The separation degree γ = mini,j γi j. This definition gives a probabilistic formulation of the intuition that a (weak) supervision signal can help distinguish certain pairs of labels. For example, a noisy annotation for multiclass classification may break the condition (8) due to a large noise rate for certain labels, but it can still provide information to separate other labels if (8) is satisfied for any other pairs of (i, j). When there are no additional constraints on the joint transition, one can construct the joint transition simply by combining the candidate transitions in T1, T2. For example, the induced distribution family by yi of joint supervision can be naturally constructed by Di(x) = {λD1 + (1 λ)D2 : D1 Di1(x), D2 Di2(x)} (11) where Di1 and Di2 are the induced distribution family by yi of O1 and O2. In this case, we present the following result to characterize the learnability under joint supervision O: Proposition 5.10 (No Free Separation). Suppose the separation degrees of yi to yj of O1 and O2 are γi j1 and γi j2 respectively. Then, if the joint transition class is constructed as (11), then the separation degrees of yi to yj for the joint supervision satisfies: γi j λγi j1 + (1 λ)γi j2 Also, if O1 O2 = , then the two equality holds. As a consequence, a necessary condition of that yi is separated from yj by the joint signal O is that yi must be separated from yj by one of O1, O2. Remark 5.11 (Defining O1 O2). The condition O1 O2 = means that the learner distinguishes different annotations. For example, in a crowdsourcing setting, we have two annotators and each provides a noisy annotation, then O1, O2 = Y. But as long as the learner distinguishes the annotations of the two annotators, we can nevertheless write O1 O2 = . Without this condition, even if both γi j1, γi j2 > 0 , we can still have γi j = 0. See the supplementary material for an example. This idea has also been explored in the empirical study of [18], which observes that in a crowdsourcing setting, the model performance improves if annotator identifiers are input as features. However, one should note that the tradeoff is the model complexity: distinguishing different annotations will in general require more parameters to model the joint transition. Remark 5.12. Proposition 5.10 shows that without constraints, the joint supervision does not create new separation, however, it can preserve the separation between labels by the original supervision signals. So in this view, the weak supervision signal can be regarded as a building block for the (global) separation (3) by contributing pairwise separation (10). An illustration is shown in fig. 2 (b). If there does exist constraints about the two transition classes, Proposition 5.10 no longer holds and joint supervision may create new separation. To illustrate, consider the following artificial example: Example 5.13 (Learning from Difference). Given a binary classification problem where Y = { 1}. Suppose we have two annotators O1 and O2 and each provides a noisy annotation with an unknown, uniform, instance-independent noise, i.e., η1 def == P(O1 = y|x, y = 1) = P(O1 = y|x, y = +1), η2 def == P(O2 = y|x, y = 1) = P(O2 = y|x, y = +1), where η1, η2 are constants independent of x. Then, the joint transition is modeled as: T = λ(1 η1) λη1 (1 λ)(1 η2) (1 λ)η2 λη1 λ(1 η1) (1 λ)η2 (1 λ)(1 η2) Now, suppose it is known that the first annotator provides a better quality of annotation, i.e., there is a γ R (known to the learner) such that η1 η2 γ < 0. To apply proposition 5.8, define the evidence Φ( ) = ˆe1/λ ˆe3/(1 λ), , then for any D1, D2, we have Φ(D1) = η2 η1 γ and Φ(D2) = η1 η2 γ. So by proposition 5.8, the original classification hypothesis is learnable. Notice that without joint supervision, separation is not guaranteed since we have no restriction on η1 or η2 individually. This example shows the necessity to model possible constraints between different supervision sources, which help to reduce the size of the joint transition class and may improve the separation degree. 6 Conclusion and Future Work In this paper, we provide a unified framework for analyzing the learnability of multiclass classification with indirect supervision. Our theory builds upon two key components: (i) The construction of the induced hypothesis class and its complexity analysis, which allows us to indirectly supervise the learning by minimizing the annotation risk. (ii) A formal description of the learner s prior knowledge about the transition and its encoding in the learning condition, which allows us to bound the classification error by the annotation risk. The notion of separation depends on the annotation loss being used. The KL-divergence may be replaced by other statistical distances, as long as the distance can induce a loss function. However, the idea behind separation is invariant: the prior knowledge needs to be strong enough to distinguish different labels via the observables. Moreover, theorem 5.2 shows that separation is a sufficient and almost necessary condition, and the later examples show separation is also practically useful and can easily produce learnability conditions. Therefore, we believe the the concepts introduced are general, and that our analysis tools can be applied in many other supervision scenarios. One limitation of our work is that the definition of learnability requires us to handle every possible DX, and the consequence is that we need to ensure separation at every x X. In future work, we may try to relax the learnability conditions by encoding prior knowledge of DX, which can be obtained from unlabeled datasets. Another direction to explore is to extend the discussion to the agnostic case as well as the case where T0 / T . Also, in a directly supervised learning setting, classical realizable PAC learning could achieve a convergence rate of O(log(1/δ)/m), which is better than the rate of O( p log(1/δ)/m) we derived for a general indirect supervision problem. It is worth exploring whether and how our bounds could be improved for more kinds of supervision signals (other than the gold label). Broader Impact Our work mostly focuses on theoretical aspects of learning, however, it provides better understanding and thus can suggest new machine learning scenarios and algorithms for learning from indirect observations; this addresses a key challenge to machine learning today, and will help machine learning researchers to reduce the cost of and need for labeled data. Our theory may have positive and negative impact on the privacy protection of sensitive data. On one hand, the theory suggests that one can alter the forms of data (via a probabilistic transition) to ensure privacy while keeping its usefulness (learnability). On the other hand, it might be possible for an attacker to recover sensitive information about the data indirectly through a related dataset. Acknowledgments and Disclosure of Funding This work was supported by the Army Research Office under Grant Number W911NF-20-1-0080 and by contract FA8750-19-2-0201 with the US Defense Advanced Research Projects Agency (DARPA). The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government. [1] D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 1988. [2] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Efficient Learning of Linear Separators under Bounded Noise. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 167 190, Paris, France, 03 06 Jul 2015. PMLR. URL http://proceedings.mlr.press/v40/Awasthi15b.html. [3] Yannick Baraud. Bounding the expectation of the supremum of an empirical process over a (weak) VC-major class. Electron. J. Statist., 10(2):1709 1728, 2016. doi: 10.1214/15-EJS1055. URL https://doi.org/10.1214/15-EJS1055. [4] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research, 3(null):463 482, 2003. [5] S. Ben-David and Reba Schuller Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Machine Learning, 73:273 287, 2007. [6] Shai Ben-David and Reba Schuller Borbely. Exploiting Task Relatedness for Mulitple Task Learning. In Proc. of the ACM Conference on Computational Learning Theory (COLT), 2003. [7] Shai Ben-David and Shai Shalev-Shwartz. Understanding Machine Learning : From Theory to Algorithms. 2014. ISBN 9781107057135. [8] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic Parsing on Freebase from Question-Answer Pairs. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013. [9] Jakramate Bootkrajang and Ata Kabán. Label-Noise Robust Logistic Regression and Its Applications. In Peter A. Flach, Tijl De Bie, and Nello Cristianini, editors, Machine Learning and Knowledge Discovery in Databases, pages 143 158, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-33460-3. [10] Ming-Wei Chang, Dan Goldwasser, Dan Roth, and Vivek Srikumar. Discriminative Learning over Constrained Latent Representations. In Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 6 2010. URL http://cogcomp.org/papers/CGRS10.pdf. [11] Ming-Wei Chang, Vivek Srikumar, Dan Goldwasser, and Dan Roth. Structured Output Learning with Indirect Supervision. In Proc. of the International Conference on Machine Learning (ICML), 2010. URL http://cogcomp.org/papers/CSGR10.pdf. [12] Jiacheng Cheng, Tongliang Liu, Kotagiri Ramamohanarao, and Dacheng Tao. Learning with Bounded Instanceand Label-dependent Label Noise. 09 2017. [13] Jesús Cid-Sueiro. Proper Losses for Learning from Partial Labels. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS 12, page 1565 1573, Red Hook, NY, USA, 2012. Curran Associates Inc. [14] Jesús Cid-Sueiro, Darío García-García, and Raúl Santos-Rodríguez. Consistency of Losses for Learning from Weak Labels. In Toon Calders, Floriana Esposito, Eyke Hüllermeier, and Rosa Meo, editors, Machine Learning and Knowledge Discovery in Databases, pages 197 210, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg. ISBN 978-3-662-44848-9. [15] James Clarke, Dan Goldwasser, Ming-Wei Chang, and Dan Roth. Driving Semantic Parsing from the World s Response. In Proc. of the Conference on Computational Natural Language Learning (Co NLL), 7 2010. URL http://cogcomp.org/papers/CGCR10.pdf. [16] Timothée Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels. Journal of Machine Learning Research, 12:1501 1536, 2011. [17] Ilias Diakonikolas, Themis Gouleakis, and Christos Tzamos. Distribution-Independent PAC Learning of Halfspaces with Massart Noise. In Neur IPS, 2019. [18] Mor Geva, Yoav Goldberg, and Jonathan Berant. Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets. In EMNLP/IJCNLP, 2019. [19] Aritra Ghosh, Naresh Manwani, and P.S. Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160:93 107, 2015. ISSN 0925-2312. doi: https://doi.org/10.1016/ j.neucom.2014.09.081. URL http://www.sciencedirect.com/science/article/pii/ S0925231215001204. [20] T. Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. Learning from Complementary Labels. In NIPS, 2017. [21] M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 1998. [22] Li-Ping Liu and Thomas G. Dietterich. Learnability of the Superset Label Learning Problem. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pages II 1629 II 1637. JMLR.org, 2014. URL http: //dl.acm.org/citation.cfm?id=3044805.3045074. [23] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. Ann. Statist., 34 (5):2326 2366, 10 2006. doi: 10.1214/009053606000000786. URL https://doi.org/10. 1214/009053606000000786. [24] Aditya Krishna Menon, Brendan van Rooyen, and Nagarajan Natarajan. Learning from binary labels with instance-dependent noise. Machine Learning, 107(8):1561 1595, 2018. doi: 10.1007/s10994-018-5715-3. URL https://doi.org/10.1007/s10994-018-5715-3. [25] B. K. Natarajan. On learning sets and functions. Machine Learning, 1989. [26] Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learning with Noisy Labels. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS 13, pages 1196 1204, USA, 2013. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999611.2999745. [27] Qiang Ning, Hangfeng He, Chuchu Fan, and Dan Roth. Partial or Complete, That s The Question. In Proc. of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2019. URL https://arxiv.org/pdf/1906.04937. pdf. [28] M.S. Pinsker. Information and information stability of random variables and processes. Holden Day series in time series analysis. Holden-Day. [29] Aditi Raghunathan, Roy Frostig, John Duchi, and Percy Liang. Estimation from Indirect Supervision with Linear Moments. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2568 2577, New York, New York, USA, 20 22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/raghunathan16.html. [30] L. Ralaivola, F. Denis, and C. Magnan. CN = CPCN. In ICML 06, 2006. [31] David Rolnick, Andreas Veit, Serge Belongie, and Nir Shavit. Deep Learning is Robust to Massive Label Noise. 05 2017. [32] Clayton Scott, Gilles Blanchard, Gregory H, and y. Classification with Asymmetric Label Noise: Consistency and Maximal Denoising. In Shai Shalev-Shwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 489 511, Princeton, NJ, USA, 12 14 Jun 2013. PMLR. URL http://proceedings.mlr.press/v30/Scott13.html. [33] Jacob Steinhardt and Percy Liang. Learning with Relaxed Supervision. In NIPS, 2015. [34] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir D. Bourdev, and Rob Fergus. Training Convolutional Networks with Noisy Labels. ar Xiv: Computer Vision and Pattern Recognition, 2014. [35] A.B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer New York, 2008. ISBN 9780387790527. [36] Brendan van Rooyen and Robert C. Williamson. A Theory of Learning with Corrupted Labels. Journal of Machine Learning Research, 18(228):1 50, 2018. URL http://jmlr. org/papers/v18/16-315.html.