# estimation_from_indirect_supervision_with_linear_moments__0a8e4f12.pdf Estimation from Indirect Supervision with Linear Moments Aditi Raghunathan ADITIR@STANFORD.EDU Roy Frostig RF@CS.STANFORD.EDU John Duchi JDUCHI@STANFORD.EDU Percy Liang PLIANG@CS.STANFORD.EDU Stanford University, Stanford, CA In structured prediction problems where we have indirect supervision of the output, maximum marginal likelihood faces two computational obstacles: non-convexity of the objective and intractability of even a single gradient computation. In this paper, we bypass both obstacles for a class of what we call linear indirectly-supervised problems. Our approach is simple: we solve a linear system to estimate sufficient statistics of the model, which we then use to estimate parameters via convex optimization. We analyze the statistical properties of our approach and show empirically that it is effective in two settings: learning with local privacy constraints and learning from low-cost count-based annotations. 1. Introduction Consider the problem of estimating a probabilistic graphical model from indirect supervision, where only a partial view of the variables is available. We are interested in indirect supervision for two reasons: first, one might not trust a data collector and wish to use privacy mechanisms to reveal only partial information about sensitive data (Warner, 1965; Evfimievski et al., 2004; Dwork et al., 2006; Duchi et al., 2013). Second, if data is generated by human annotators, say in a crowdsourcing setting, it can often be more cost-effective to solicit lightweight annotations (Oded & Tom as, 1998; Mann & Mc Callum, 2008; Quadrianto et al., 2008; Liang et al., 2009). In both cases, we trade statistical efficiency for another resource: privacy or annotator cost. Indirect supervision is naturally handled by defining a latent-variable model where the structure of interest is treated as a latent variable. While statistically sensible, Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). learning latent-variable models poses two computational challenges. First, maximum marginal likelihood requires non-convex optimization, where popular procedures such as gradient descent or Expectation Maximization (EM) (Dempster et al., 1977) are only guaranteed to converge to local optima. Second, even the computation of the gradient or performing the E-step can be intractable, requiring probabilistic inference on a loopy graphical model induced by the indirect supervision (Chang et al., 2007; Grac a et al., 2008; Liang et al., 2009). In this paper, we propose an approach that bypasses both computational obstacles for a class which we call linear indirectly-supervised learning problems. We lean on the method of moments (Pearson, 1894), which has recently led to advances in learning latent-variable models (Hsu et al., 2009; Anandkumar et al., 2013; Chaganty & Liang, 2014), although we do not appeal to tensor factorization. Instead, we express indirect supervision as a linear combination of the sufficient statistics of the model, which we recover by solving a simple noisy linear system. Once we have the sufficient statistics, we use convex optimization to solve for the model parameters. The key is that while supervision per example is indirect and leads to intractability, aggregation over a large number of examples renders the problem tractable. While our moments-based estimator yields computational benefits, we suffer some statistical loss relative to maximum marginal likelihood. In Section 5, we compare the asymptotic variance of marginal-likelihood and momentbased estimators, and provide some geometric insight into their differences in Section 6. Finally, in Section 7, we apply our framework empirically to our two motivating settings: (i) learning a regression model under local privacy constraints, and (ii) learning a part-of-speech tagger with lightweight annotations. In both applications, we show that our moments-based estimator obtains good accuracies. Estimation from Indirect Supervision with Linear Moments x y o model supervision Figure 1. We solve a structured prediction problem from x to y. During training, we observe not y, but indirect supervision o. Notation. We use superscripts to enumerate instances in a data sample (e.g. x(1), . . . , x(n)), and square-bracket indexing to enumerate components of a vector or sequence: x[b] denotes the component(s) of x associated with b. For a real vector x 2 Rd, we let x 2 def = xx>. Model. Consider the structured prediction task of mapping an input x 2 X to some output y 2 Y. We model this mapping using a conditional exponential family p (y | x) = exp{φ(x, y)> A( ; x)}, (1) where φ : X Y ! Rd is the feature mapping, 2 Rd is the parameter vector, and A( ; x) = log P y2Y exp{φ(x, y)> } is the log-partition function. For concreteness, we specialize to conditional random fields (CRFs) (Lafferty et al., 2001) over collections of K-variate labels, where x = (x[1], . . . , x[L]) and y = (y[1], . . . , y[L]) 2 [K]L; here L is the number of variables and K is the number of possible labels per variable. We let C 2[L] be the set of cliques in the CRF, so that the features decompose into a sum over cliques: φ(x, y) = P c2C φc(y[c], x, c). As one particular example, if C consists of all nodes {{1}, . . . , {L}} and edges between adjacent nodes {{1, 2}, . . . , {L 1, L}}, the CRF is chain-structured. Learning from indirect supervision. In the indirectly supervised setting that is the focus of this paper, we do not have access to y but rather only observations o 2 O, where o is drawn from a known supervision distribution S(o | y). For each i = 1, . . . , n, let (x(i), y(i)) be drawn from some unknown data-generating distribution p (by default, we do not assume the CRF is well-specified), and o(i) is drawn according to S( | y(i)) as in Figure 1. The learning problem is then the natural one: given the training examples (x(1), o(1)), . . . , (x(n), o(n)), we wish to produce an estimate ˆ for the model (1). Maximum marginal likelihood. The classic paradigm is to maximize the marginal likelihood: def = argmax S(o | y)p (y | x) {(x(i), o(i))} ˆµ ˆ linear system convex optimization Figure 2. Our approach is to (i) solve a linear system based on the data {(x(i), o(i))} to estimate the sufficient statistics ˆµ, then (ii) use convex optimization to estimate the model parameters ˆ . where ˆE denotes an expectation over the training sample. While ˆ marg is often statistically efficient, there are two computational difficulties associated with this approach: 1. The log-likelihood objective (2) is typically non- convex, so computing ˆ marg exactly is in general intractable; see Section 6 for a more detailed discussion. Local algorithms like Expectation Maximization (Dempster et al., 1977) are only guaranteed to converge to local optima. 2. Computing the gradient or the E-step requires com- puting p(y | x, o) / p (y | x)S(o | y), which is intractable, not due to the model p , but to the supervision S. This motivates a number of relaxations (Grac a et al., 2008; Liang et al., 2009; Steinhardt & Liang, 2015), but there are no guarantees on approximation quality. Our approach: moment-based estimation. We present a simple approach to circumvent the above issues for a restricted class of models, in the same vein as Chaganty & Liang (2014). To begin, consider the fully-supervised setting, where we observe examples {(x(i), y(i))}. In this case, we could maximize the likelihood ˆE[log p (y | x)], solving argmax {ˆµ> ˆE[A( ; x)]}, where ˆµ def = ˆE[φ(x, y)] 2 Rd are the sufficient statistics, which con- verge to µ def = E[φ(x, y)]. Therefore, if we could construct a consistent estimate ˆµ of µ , then we could solve the same convex optimization problem used in the fully-supervised estimator. Of course, we do not have access to ˆE[φ(x, y)]. Instead, in our (linearly) indirectly supervised setting, we are able to define an observation function β(x, o) 2 Rd which is nonetheless in expectation equal to the population sufficient statistics: E[β(x, o) | x] = φ(x, y), E[β(x, o)] = µ . (3) In general, we construct β(x, o) by solving a linear system. Putting the pieces together yields our estimator (Figure 2): 1. Sufficient statistics: ˆµ = ˆE[β(x, o)]. 2. Parameters: ˆ mom def = arg max {ˆµ ˆE[A( ; x)]}. In the next two sections, we describe the observation function β(x, o) for learning with local privacy (Section 3) and lightweight annotations (Section 4). Estimation from Indirect Supervision with Linear Moments 3. Learning under local privacy Suppose we wish to estimate a conditional distribution p (y | x), where x is non-sensitive information about an individual and y contains sensitive information (for example, income or disease status). Individuals, because of a variety of reasons mistrust, embarrassment, fear of discrimination may wish to keep y private and instead release some o S( | y). To quantify the amount of privacy afforded by S, we turn to the literature on privacy in databases and theoretical computer science (Evfimievski et al., 2004; Dwork et al., 2006) and say that S is -differentially private if any two y, y0 have comparable probability (up to a factor of exp( )) of generating o: S(o | y) S(o | y0) exp( ). (4) What S should we employ? We first explore the classical randomized response (RR) mechanism (Section 3.1), and then develop a new mechanism that leverages the graphical model structure (Section 3.2). 3.1. Classic randomized response Warner (1965) proposed the now-classical randomized response technique, which proceeds as follows: For some fixed (generally small) > 0, the respondent reveals y with probability and with probability 1 draws a sample from a (known) base distribution u generally uniform over Y. Formally, the classical randomized response supervision is def = I[o = y] + (1 )u(o). (5) Estimation. Our goal is to construct a function β satisfying (3). Towards that end, let us start with what we can estimate and expand based on (5): E[φ(x, o)] = E[φ(x, y)] + (1 )E[φ(x, y0)], (6) where y0 u. Rearranging (6), we see that we can solve for µ = E[φ(x, y)]. Indeed, if we define the observation function: def = φ(x, o) (1 )E[φ(x, y0) | x] we can verify that E[β(x, o)] = µ . Privacy. We can check that the ratio S(o | y)/S(o | y0) 1 + +(1 )u(o) (1 )u(o) , so classical randomized response is (1 ) mino u(o)-differentially private. For any distribution u, this value is at least 1 |Y|, a linear dependence on |Y|. In classical randomized response settings, |Y| = 2, which is unproblematic. In contrast, in structured prediction settings, the number of labels is exponential in the number of variables (|Y| = KL), so we must take = O asymptotic variance of ˆ mom scales as 2 (as will be shown in Section 5), which makes classical randomized response unusable for structured prediction. 3.2. Structured randomized response With this difficulty in mind, we recognize that we must somehow leverage the structure of the sufficient statistics vector φ(x, y) to perform estimation. In particular, we show that the supervision should only depend on the sufficient statistics: Proposition 1. Let O be the set in which observations live. For any privacy mechanism S(o | x, y) that is - differentially private, there exists a mechanism S0(o0 | φ(x, y)) that is at least -differentially private, and for any set A O, we have P[o0 2 A | x] = P[o 2 A | x], (8) where o0 S0( | φ(x, y)) and o S( | x, y). In short, we can always construct S0 that only uses the sufficient statistics φ(x, y) but yields the same joint distribution over the pairs (x, o). Furthermore, S0 is at least as private as the original mechanism S. See Appendix A.1 for a proof. This motivates a focus exclusively on mechanisms that use sufficient statistics, and in particular, we consider the following two structured randomized response mechanisms. Our schemes are both two-phase procedures that first binarize the sufficient statistics, and then release a set of observations inspired by Duchi et al. s minimax optimal approach to estimating a multinomial distribution. For t > 0, let qt 1+et/2 . Assume each coordinate of the statistics φ lies in the interval [0, c] for some positive scalar c. For i 2 {1, . . . , d}, draw o[i] as a Bernoulli variable with bias 1 cφ(x, y)[i]. Then: (Coordinate release) Draw a coordinate j 2 {1, . . . , d} from a distribution pcr. Set ocr = o[j] with probability q , otherwise ocr = 1 o[j]. Release the pair (j, ocr). (Per-value φ-RR) Denote by (x, y) the support of o given x, y, let δ def = supx,y, o2 (x,y) k ok1, and take any δ δ. For j = 1, . . . , d, set opv[j] = o[j] with probability q / δ, otherwise opv[j] = 1 o[j]. Release the vector opv. Both are -differentially private (see Appendix A.1). For coordinate release, define the observation function βcr(x, ocr) cr (j) ocr 1 + q where ej denotes the j th standard basis vector. For the per- Estimation from Indirect Supervision with Linear Moments x The quick brown [fox jumps over the lazy dog] y DT JJ JJ [NN VBZ IN DT JJ NN] o # NN = 2 Table 1. Part-of-speech tagging with region annotations. An annotator is given a region (bold, in brackets) and asked to count the number of times particular tags (e.g., NN) occurs. value statistics scheme, define the observation function, βpv(x, opv) opv 1 + q / δ1 2q / δ 1 c. (9) In either case, we have that E[β(x, o)] = µ , as required by (3) for ˆ mom to be consistent. The two schemes offer a tradeoff: when o is dense, coordinate release is advantageous, as our best norm bound δ may be as large as the dimension d, so although we reveal only a single coordinate at a time, we noise it by a lower-variance distribution q rather than the q /d noise of the per-value scheme. Meanwhile, per-value φ-RR enjoys lower variance when o has low 1 norm. The latter case arises, for instance, if φ is a sparse binary vector as is common in structured prediction. Summarizing, we have three randomized response schemes. Classical RR appeals only in unstructured problems with few outputs Y. In the structured setting, we can move to the sufficient statistics φ by Proposition 1, and exploit their structure with either of two schemes based on our knowledge of the 1-norm or sparsity of statistics φ. 4. Learning with lightweight annotations For a sequence labeling task, e.g., part-of-speech (POS) tagging, it can be tedious to obtain fully-labeled inputoutput sequences for training. This motivates a line of work which attempts to learn from weaker forms of annotation (Mann & Mc Callum, 2008; Haghighi & Klein, 2006; Liang et al., 2009). We focus on region annotations, where an annotator is asked to examine only a particular subsequence of the input and count the number of occurrences of some label (e.g., nouns). The rationale is that it is cognitively easier for the annotator to focus on one label at a time rather than annotating from a large tag set, and physically easier to hit a single yes/no or counter button than to select precise locations, especially in mobile-based crowdsourcing interfaces (Vaish et al., 2014). See Table 1 for an example. More formally, the supervision S(o | y) is defined as follows: First, choose the starting position jstart uniformly from {1, . . . , L w}, and set the ending position jend = jstart + w 1, where w is a fixed window size. Let r = {jstart, . . . , jend} denote this region. Next, choose a subset of tags B uniformly from the tag set (e.g., {NN, DT}). From here, the observation o is generated deterministically: For each tag b 2 B, the annotator counts the number of occurrences in the region: N[b] = |{j 2 r : y[j] = b}|. The final observation is o = (r, B, N). In this setting, not only is the marginal likelihood nonconvex, inference requires summing over possible ways of realizing the counts, which is exponential either in the window size w and |B|. Estimation. For our estimator to work, we make two assumptions: 1. The node potentials only depend on x[j]: φj(y[j], x, j) = f(x[j], y[j]); and 2. Under the true conditional distribution, y[j] only de- pends on x[j]: p (y[j] | x) = p (y[j] | x[j]). These are admittedly strong independence assumptions similar to IBM model 1 for word alignment (Brown et al., 1993) or the unordered translation model of Steinhardt & Liang (2015). Even though our model is fully factorized and lacks edge potentials, inference p (y | x, o) is expensive as conditioning on the indirect supervision o couples all of the y variables. This typically calls for approximate inference techniques common to the realm of structured prediction. Steinhardt & Liang (2015) developed a relaxation to cope with this supervision, but this still requires approximate inference via sampling and non-convex optimization. In contrast to the local privacy examples, the new challenge is that the observation o does not provide enough information to evaluate a single node potential, even stochastically. So we cannot directly write µ in terms of functions of the observations. As a bridge, define the localized conditional distributions: w (a, b) def = P[y[j] = b | x[j] = a], which by assumption 2 specify the entire conditional distribution. The sufficient statistics µ can be written as in terms of w : w (x[j], b)f(x[j], b) We now define constraints that relate the observations o to w . Recall that each observation o includes a region r, a tag b, and a vector of counts N = [P j2r I[y[j] = b]]b2B, one for each tag b. For each input x 2 X and tag b, we have the identity: E[N[b] | x, r] = w (x[j], b). (11) While we do not observe the LHS, we observe N[b], which is unbiased estimate of the RHS of (11). We can therefore solve a regression problem with response N to recover a Estimation from Indirect Supervision with Linear Moments consistent estimate ˆw of w : ˆw = arg min w(x(i)[j], b) N (i)[b] For instance, the example in Table 1 contributes: (P[NN | fox] + P[NN | jumps] + + P[NN | dog] 2)2. Finally, we plug in ˆw into (10) obtain ˆµ. 5. Asymptotic analysis We have two estimators: maximum marginal likelihood (ˆ marg), which is difficult to compute, requiring non-convex optimization and possibly intractable inference; and our moments-based estimator (ˆ mom), which is easy to compute, requiring only solving a linear system and convex optimization. In this section, we study and compare the statistical efficiency of ˆ marg and ˆ mom. For simplicity, we focus on unconditional setting where x is empty, and omit x in this section. We also assume our exponential family model is well-specified and that are the true parameters. All expectations are taken with respect to y p . Recall from (3) that E[β(o) | x] = φ(y). We can therefore think of β(o) as a best guess of φ(y). The following lemma provides the asymptotic variances of the estimators: Proposition 2 (General asymptotic variances). Let I def = Cov[φ(y)] be the Fisher information matrix. Then pn(ˆ marg ) d ! N(0, marg) and pn(ˆ mom ) d ! N(0, mom), where the asymptotic variances are marg = (I E[Cov[φ(y) | o]]) 1, (13) mom = I 1 + I 1E[Cov[β(o) | y]]I 1. (14) Let us compare the asymptotic variances of ˆ marg and ˆ mom to that of the fully-supervised maximum likelihood estimator ˆ full, which has access to {(x(i), y(i))}, and satisfies pn(ˆ full ) d ! N(0, I 1). Examining the asymptotic variance of ˆ marg (13), we see that the loss in statistical efficiency with respect to maximum likelihood is the amount of variation in φ(y) not explained by o, Cov[φ(y) | o]. Consequently, if y is simply deterministic given o, then Cov[φ(y) | o] = 0, and ˆ marg achieves the statistically efficient asymptotic variance I 1. The story with ˆ mom is dual to that for the marginal likelihood estimator. Considering the second term in expression (14), we see that the loss of efficiency due to our observation model grows linearly in the variability of the observations β(o) not explained by y. Thus, unlike ˆ marg, even if y is deterministic given o (so o reveals full information about y), we do not recover the efficient covariance I 1. As a trivial example, let φ(y) = y 2 {0, 1} and the observation o = [y, y + ]> for N(0, 1), so that o contains a faithful copy of y, and let β(o) = o[1]+o[2] 2. Then E[Cov[β(o) | y]] = 1 4, and the asymptotic relative efficiency of ˆ mom to ˆ marg is 1 1+I 1/4. Roughly, ˆ marg inte- grates posterior information about y better than ˆ mom does. Proof. To compute marg, we follow standard arguments in van der Vaart (1998). If (o; ) is the marginal log-likelihood, then a straightforward calculation yields r (o, ) = E[φ(y) | o] E[φ(y)]. The asymptotic variance is the inverse of E[r (o, )r (o, )>] = Cov[E[φ(y) | o]]; applying the variance decomposition I = Cov[E[φ(y) | o]] + E[Cov[φ(y) | o]] gives (13). To compute mom, recall that the moments-based estimator computes ˆµ = ˆE[β(o)] and ˆ = (r A) 1(ˆµ). Apply the delta method, where r(r A) 1(µ ) = (r2A( )) 1 = Cov[φ(y)] 1. Finally, decompose Cov[β(o)] = Cov[E[β(o) | y]] + E[Cov[β(o) | y]] and recognize that E[β(o) | y] = φ(y) to obtain (14). Randomized response. To obtain concrete intuition for Proposition 2, we specialize to case where S is the randomized response (5). In this setting, β(o) = 1φ(o) h for some constant vector h. Recall the supervision model: z Bernoulli( ), o = y if z = 1 and o = y0 u if z = 0. Lemma 1 (asymptotic variances (randomized response)). Under the randomized response model of (5), the asymptotic variance of ˆ marg is mom = I 1 + I 1HI 1, (15) 2 Cov[φ(y0)] + 1 E[(φ(y) E[φ(y0)]) 2]. The matrix H governs the loss of efficiency, which stems from two sources: (i) Cov[φ(y0)], the variance when we sample y0 u; and (ii) the variance in choosing between y and y0. If y0 and y have the same distribution, then H = I 1 2 2 and mom = 2I 1. Proof. We decompose Cov[β(o) | y] as E[Cov[β(o) | y, z] | y] + Cov[E[β(o) | y, z] | y] = E[ 0 + (1 ) Cov[ 1φ(y0)] | y] + Cov[z 1φ(y) + (1 z)E[ 1φ(y0)] | y] 2 Cov[φ(y0)] + 1 (φ(y) E[φ(y0)]) 2, where we used β(y) = 1φ(y) h. Estimation from Indirect Supervision with Linear Moments Figure 3. The efficiency of ˆ mom relative to ˆ marg as varies for weak (a) and strong (b) signals . An empirical plot. The H ajek-Le Cam convolution and local asymptotic minimax theorems give that ˆ marg is the most statistically efficient estimator. We now empirically study the efficiency of ˆ mom relative to ˆ marg, where Eff def = d 1 tr( marg 1 mom), the average of the relative variances per coordinate of ˆ marg to ˆ mom. We continue to focus on randomized response in the unconditional case. To study the effect of , we consider the following probability model: we let y 2 {1, 2, 3, 4}, define and set p (y) / exp( >φ(y)). We set = [2, 0.1]> and = [5, 1]> to represent weak and strong signals (the latter is harder to estimate, as the Fisher information matrix is much smaller); when = 0, the asymptotic variances are equal, mom = marg. In Figure 3, we see that the asymptotic efficiency of ˆ mom relative to ˆ marg decreases as ! 0, which is explained by the fact that as we see in expression (13) the ˆ marg estimator leverages the prior information about y based on , while as ! 0, expression (15) is dominated by the 1/ 2 Cov[φ(y0)] term, where y0 is uniform. Moreover, as grows larger, the conditional covariance Cov[φ(y) | o] is much smaller than the covariance Cov[β(o) | y], so that we expect that marg mom. 6. The geometry of two-step estimation We now provide some geometric intuition about the differences between ˆ marg and ˆ mom, establishing a connection between ˆ mom and the EM algorithm as a byproduct of our discussion. For concreteness, let Y = {1, . . . , m} be a finite set and let P be the set of all distributions over Y (represented as m-dimensional vectors). Let F def = {p : 2 Rd} P be a natural exponential family over Y with p (y) / exp( >φ(y)). See Figure 4 for an example where m = 3 and d = 2. Note that in the space of distributions, F is a non-convex set. Let O = {1, . . . , k} be the set of observations. We can represent the supervision function S(o | y) as a matrix S 2 Rk m. For p 2 P, we can express the marginal distribution over o as q = Sp. Let ˆq = 1 i=1 δo(i) be the empirical distribution over observations. The maximum marginal likelihood estimator can now be written succinctly as: ˆ marg = argmin KL (ˆqk Sp) . (16) While the KL-divergence is concave in p, the non-convex constraint set F makes the problem difficult. Figure 4. Visualization of the exponential family F and all distributions P over Y = {1, 2, 3}; here P is the 3-simplex. The model features for F are φ(1) = 5, φ(2) = 1, φ(3) = 0. The blue curve marks out the exponential family F = {p : 2 R}. Observations yield two moment equations (dotted green) whose intersection with the simplex pins down the data distribution. Our moment-based estimator ˆ mom can be viewed as a relaxation, where we first optimize over a relaxed set P and then project onto the exponential family: ˆr = argmin D(ˆq, Sp), ˆp = argmin KL (ˆrkp) . (17) The first step can be computed directly via r = S ˆq if D is the squared Euclidean distance. If D is KL-divergence, we can use EM (see the composite likelihood objective of Chaganty & Liang (2014)), which converges to the global optimum. The result is a single distribution ˆr over y. The second step optimizes over p via , which is a convex optimization problem, resulting in ˆp corresponding to ˆ mom. Figure 5. Log-marginal likelihood log P3 y=1 S(o | y)p (y), where the exponential family features are φ(1) = 2, φ(2) = 1, φ(3) = 0. The model is well specified with S = [ 1 Computing ˆ marg generally requires solving a non-convex optimization problem (see Figure 5 for an example). When S has full column rank and the model is well-specified, ˆ mom is consistent: we have that ˆr = ˆp = S ˆq Estimation from Indirect Supervision with Linear Moments Figure 6. R2 coefficient for linear regression when estimating from privately revealed sufficient statistics on two datasets. S Sp = p . This means that eventually the KL projection of problem (17) is essentially an identity operation: we almost have ˆr 2 F by the rank assumptions, making the problem easy. This assumption strongly depends on the well-specifiedness of the supervision; indeed, if q 6= Sp for any p 2 P, then kˆ marg ˆ momk c > 0, for a constant c, even as n ! 1. We can relax the column rank assumption, however: S simply needs to contain enough information about the sufficient statistics, that is, if Φ = [φ(1) φ(m)] 2 Rd m is matrix of sufficient statistics, we require that Φ = SR for some matrix R. Deterministic supervision. When the supervision matrix S has full column rank, ˆ mom converges to . There are certainly cases where ˆ marg is consistent, but ˆ mom is not. What can we say about ˆ mom in this case? To obtain intuition, consider the case the supervision is a deterministic function that maps y to o (region annotations is an example). In this case, every column of S is an indicator vector, and S = S> diag(S1) 1 = S>(SS>) 1. Here, S distributes probability mass evenly across all the y 2 Y that deterministic map to o. In this case, ˆ mom simply corresponds to running one iteration of EM on the marginal likelihood, initializing with the uniform distribution over y ( = 0). The E-step conditions on o and places uniform mass over consistent y, producing ˆr; the M-step optimizes based on ˆr. 7. Experiments Local privacy. Following Section 3, we consider locally private estimation of a structured model. We take linear regression as a simple such structured model: it corresponds to a pairwise random field over the inputs and the response. The sufficient statistics are edge features φi,j(xi, xj) = xixj and φi(xi, y) = xiy for each i, j 2 [d]. On the housing dataset, the supervision o is given under the per-value RR scheme. On the songs dataset, o is a dense vector, motivating the coordinate release scheme instead. We choose i 2 [d] at random, with probability 1 2 reveal xiy and with probability 1 2 reveal [x2 i , xixj, x2 j] with suitable noise as described in Section 3.2. Note that the noising mechanism privatizes both input variables x as well as the response y. Figure 6 visualizes the average (over 10 trials) R2 coefficient of fit for linear regression on the test set,1 in response to varying the privacy parameter .2 As expected, the efficiency degrades with increase in the privacy constraint, though for moderate values of the loss is not significant. Lightweight annotations. We experiment with estimating a conditional model for part-of-speech (POS) sequence tagging from lightweight annotations.3 Every example in the dataset reveals a sentence and the counts of all tags over a consecutive window. Following the modeling assumptions in Section 4, we use a CRF (per Section 2) with only node features: φj(y[j], x, j)[g, a, b] = I[g(xi) = a, y[i] = b], where g is a function on the word (e.g., word, prefix, suffix and word signature). When the problem is fully supervised, we maximize the log-likelihood with stochastic gradient descent (SGD); in this case, estimation is convex and exact gradients can be tractably computed. Under count supervision, convexity of the marginal likelihood is not guaranteed. Although the model has no edge features, the indirect count supervision places an potential over the region in which counts are revealed (one enforcing that the tag sequence is compatible with the counts). This renders exact inference intractable, so we approximate it using beam search to compute stochastic gradients.4 The moment-based estimator is unaffected by this issue as it requires no inference and proceeds via a pair convex minimization programs; we minimize both using SGD. Figure 7 shows train and test accuracies as we make passes over the dataset. Typically, after sufficiently many passes, the marginal likelihood gains an advantage over the moment-based estimator. For small regions, we expect the beam search approximation to be accurate, and indeed the marginal likelihood estimator is dominant there. For larger regions, the moment-based estimator (i) achieves high accuracy early and (ii) dominates for several passes before the marginal likelihood estimator overtakes it. Altogether, the experiment highlights that the moment-based estimator is favorable in computationally-constrained settings. 1The (uncentered) R2 coefficient of parameters w in a linear regression with design X and labels Y is k Xw Y k2/k Y k2. 2We use the housing (mlcomp.org/datasets/840) and songs (mlcomp.org/datasets/748) data from mlcomp. 3We used the Wall Street Journal portion of the Penn Treebank. Sections 0-21 comprise the training set and 22-24 are test. 4The dataset has 45 tag values. We use a beam of size 500 after analytically marginalizing nodes outside the region. Estimation from Indirect Supervision with Linear Moments 7m HHv bm T2 p Bb2/- i BM ++m +v 7m HHv bm T2 p Bb2/- i2bi ++m +v r BM/Qr Ri BM ++m +v r BM/Qr Ri2bi ++m +v r BM/Qr Ryi BM ++m +v r BM/Qr Ryi2bi ++m +v r BM/Qr kyi BM ++m +v r BM/Qr kyi2bi ++m +v r BM/Qr jyi BM ++m +v r BM/Qr jyi2bi ++m +v r BM/Qr 9yi BM ++m +v r BM/Qr 9yi2bi ++m +v Figure 7. Train and test per-position accuracies for ˆ marg and ˆ mom on part-of-speech tagging, under various sized regions of count annotations, as training passes are taken through the dataset. 8. Related work and discussion This work was motivated by two use cases of indirect supervision: local privacy and cheap annotations. Each trades off statistical accuracy for another resource: privacy or annotation cost. Local privacy has its roots in classical randomized response for conducting surveys (Warner, 1965), which has been extended to the multivariate (Tamhane, 1981) and conditional (Matloff, 1984) settings. In the computer science community, differential privacy has emerged as a useful formalization of privacy (Dwork, 2006). We work with the stronger notion of local differential privacy (Evfimievski et al., 2004; Kasiviswanathan et al., 2011; Duchi et al., 2013). Our contribution here is two-fold: First, we bring local privacy to the graphical model setting, which provides an opportunity for the privacy mechanism to be sensitive to the model structure. While we believe our mechanisms are reasonable, an open question is designing optimal mechanisms in the structured case. Second, we connect privacy with other forms of indirect supervision. The second use case is learning from lightweight annotations, which has taken many forms in the literature. Multiinstance learning (Oded & Tom as, 1998) is popular in computer vision, where it is natural to label the presence but not location of objects (Babenko et al., 2009). In natural language processing, there also been work on learning from structured outputs where, like this work, only counts of labels are observed (Mann & Mc Callum, 2008; Liang et al., 2009). However, these works resort to likelihood-based approaches which involve non-convex optimization and approximate inference, whereas in this work, we show that linear algebra and convex optimization suffice under modeling assumptions. Quadrianto et al. (2008) showed how to learn from label proportions of groups of examples, using a linear system technique similar to ours. However, they assume that the group is conditionally independent of the example given the label, which would not apply in our region-based annotation setup since our regions contain arbitrarily correlated inputs and heterogeneous labels. In return, we do need to make the stronger assumption that each label y[i] depends only on a discrete x[i], so that the credit assignment can be done using a linear program. An open challenge is to allow for heterogeneity with complex inputs. Indirect supervision arises more generally in latent-variable models, which arises in machine translation (Brown et al., 1993), semantic parsing (Liang et al., 2011), object detection (Quattoni et al., 2004), and other missing data problems in statistics (M & Naisyin, 2000). The indirect supervision problems in this paper have additional structure: we have an unknown model p and a known supervision function S. It is this structure allows us to obtain computationally efficient method of moments procedures. We started this work to see how much juice we could squeeze out of just linear moment equations, and the answer is more than we expected. Of course, for more general latent-variable models beyond linearly indirectlysupervised problems, we would need more powerful tools. In recent years, tensor factorization techniques have provided efficient methods for a wide class of latent-variable models (Hsu et al., 2012; Anandkumar et al., 2012; Hsu & Kakade, 2013; Anandkumar et al., 2013; Chaganty & Liang, 2013; Halpern & Sontag, 2013; Chaganty & Liang, 2014). One can leverage even more general polynomialsolving techniques to expand the set of models (Wang et al., 2015). In general, the method of moments allows us to leverage statistical structure to alleviate computational intractability, and we anticipate more future developments along these lines. Estimation from Indirect Supervision with Linear Moments Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y. Two SVDs suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), 2012. Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Tel- garsky, M. Tensor decompositions for learning latent variable models. ar Xiv, 2013. Babenko, B., Yang, M., and Belongie, S. Visual tracking with online multiple instance learning. In Computer Vision and Pattern Recognition (CVPR), pp. 983 990, 2009. Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263 311, 1993. Chaganty, A. and Liang, P. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning (ICML), 2013. Chaganty, A. and Liang, P. Estimating latent-variable graphical models using moments and likelihoods. In International Conference on Machine Learning (ICML), 2014. Chang, M., Ratinov, L., and Roth, D. Guiding semisupervision with constraint-driven learning. In Association for Computational Linguistics (ACL), pp. 280 287, 2007. Dempster, A. P., M., L. N., and B., R. D. Maximum likeli- hood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1 38, 1977. Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Local privacy and statistical minimax rates. In Foundations of Computer Science (FOCS), 2013. Dwork, C. Differential privacy. In Automata, languages and programming, pp. 1 12, 2006. Dwork, C., Mc Sherry, F., Nissim, K., and Smith, A. Cal- ibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography Conference, pp. 265 284, 2006. Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. Privacy preserving mining of association rules. Information Systems, 29(4):343 364, 2004. Grac a, J., Ganchev, K., and Taskar, B. Expectation maxi- mization and posterior constraints. In Advances in Neural Information Processing Systems (NIPS), pp. 569 576, 2008. Haghighi, A. and Klein, D. Prototype-driven learning for sequence models. In North American Association for Computational Linguistics (NAACL), pp. 320 327, 2006. Halpern, Y. and Sontag, D. Unsupervised learning of noisy- or Bayesian networks. In Uncertainty in Artificial Intelligence (UAI), 2013. Hsu, D. and Kakade, S. M. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In Innovations in Theoretical Computer Science (ITCS), 2013. Hsu, D., Kakade, S. M., and Zhang, T. A spectral algorithm for learning hidden Markov models. In Conference on Learning Theory (COLT), 2009. Hsu, D., Kakade, S. M., and Liang, P. Identifiability and unmixing of latent parse trees. In Advances in Neural Information Processing Systems (NIPS), 2012. Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhod- nikova, S., and Smith, A. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011. Lafferty, J., Mc Callum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling data. In International Conference on Machine Learning (ICML), pp. 282 289, 2001. Liang, P., Jordan, M. I., and Klein, D. Learning from mea- surements in exponential families. In International Conference on Machine Learning (ICML), 2009. Liang, P., Jordan, M. I., and Klein, D. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL), pp. 590 599, 2011. M, R. J. and Naisyin, W. Inference for imputation estima- tors. Biometrika, 87(1):113 124, 2000. Mann, G. and Mc Callum, A. Generalized expectation cri- teria for semi-supervised learning of conditional random fields. In Human Language Technology and Association for Computational Linguistics (HLT/ACL), pp. 870 878, 2008. Matloff, N. S. Use of covariates in randomized response settings. Statistics & Probability Letters, 2(1):31 34, 1984. Estimation from Indirect Supervision with Linear Moments Oded, M. and Tom as, L. A framework for multipleinstance learning. Advances in neural information processing systems, pp. 570 576, 1998. Pearson, K. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71 110, 1894. Quadrianto, N., Smola, A. J., Caetano, T. S., and Le, Q. V. Estimating labels from label proportions. In International Conference on Machine Learning (ICML), pp. 776 783, 2008. Quattoni, A., Collins, M., and Darrell, T. Conditional ran- dom fields for object recognition. In Advances in Neural Information Processing Systems (NIPS), 2004. Steinhardt, J. and Liang, P. Learning with relaxed super- vision. In Advances in Neural Information Processing Systems (NIPS), 2015. Tamhane, A. C. Randomized response techniques for mul- tiple sensitive attributes. Journal of the American Statistical Association (JASA), 76(376):916 923, 1981. Vaish, R., Wyngarden, K., Chen, J., Cheung, B., and Bern- stein, M. S. Twitch crowdsourcing: crowd contributions in short bursts of time. In Conference on Human Factors in Computing Systems (CHI), pp. 3645 3654, 2014. van der Vaart, A. W. Asymptotic statistics. Cambridge University Press, 1998. Wang, S., Chaganty, A., and Liang, P. Estimating mixture models via mixture of polynomials. In Advances in Neural Information Processing Systems (NIPS), 2015. Warner, S. L. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association (JASA), 60(309): 63 69, 1965.